ebook img

Think Like a Data Scientist. Tackle the data science process step-by-step PDF

331 Pages·2017·5.21 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Think Like a Data Scientist. Tackle the data science process step-by-step

Tackle the data science process step-by-step Brian Godsey M A N N I N G The lifecycle of a data science project Data science process Prepare Finish Build Set goals Wrap up Explore Revise Wrangle Deliver Assess Execute Plan Optimize Analyze Engineer This book is organized around the three phases of a data science project:  The first phase is preparation—time and effort spent gathering information at the beginning of a project can spare big headaches later.  The second phase is building the product, from planning through execution, using what you learned during the preparation phase and all the tools that statistics and software can provide.  The third and final phase is finishing—delivering the product, getting feedback, making revisions, supporting the product, and wrapping up the project. Think Like a Data Scientist TACKLE THE DATA SCIENCE PROCESS STEP-BY-STEP BRIAN GODSEY MANNING SHELTER ISLAND For online information and ordering of this and other Manning books, please visit www.manning.com. The publisher offers discounts on this book when ordered in quantity. For more information, please contact Special Sales Department Manning Publications Co. 20 Baldwin Road PO Box 761 Shelter Island, NY 11964 Email: [email protected] © 2017 by Manning Publications Co. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by means electronic, mechanical, photocopying, or otherwise, without prior written permission of the publisher. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in the book, and Manning Publications was aware of a trademark claim, the designations have been printed in initial caps or all caps. Recognizing the importance of preserving what has been written, it is Manning’s policy to have the books we publish printed on acid-free paper, and we exert our best efforts to that end. Recognizing also our responsibility to conserve the resources of our planet, Manning books are printed on paper that is at least 15 percent recycled and processed without the use of elemental chlorine. Manning Publications Co. Development editor: Karen Miller 20 Baldwin Road Review editor: Aleksandar Dragosavljevic´ PO Box 761 Technical development editor: Mike Shepard Shelter Island, NY 11964 Project editor: Kevin Sullivan Copy editor: Linda Recktenwald Proofreader: Corbin Collins Typesetter: Dennis Dalinnik Cover designer: Marija Tudor ISBN: 9781633430273 Printed in the United States of America 1 2 3 4 5 6 7 8 9 10 – EBM – 22 21 20 19 18 17 To all thoughtful, deliberate problem-solvers who consider themselves scientists first and builders second For everyone everywhere who ever taught me anything brief contents PART 1 PREPARING AND GATHERING DATA AND KNOWLEDGE .....1 1 ■ Philosophies of data science 3 2 ■ Setting goals by asking good questions 19 3 ■ Data all around us: the virtual wilderness 37 4 ■ Data wrangling: from capture to domestication 67 5 ■ Data assessment: poking and prodding 84 PART 2 BUILDING A PRODUCT WITH SOFTWARE AND STATISTICS ........................................................105 6 ■ Developing a plan 107 7 ■ Statistics and modeling: concepts and foundations 129 8 ■ Software: statistics in action 166 9 ■ Supplementary software: bigger, faster, more efficient 201 10 ■ Plan execution: putting it all together 215 PART 3 FINISHING OFF THE PRODUCT AND WRAPPING UP ......237 11 ■ Delivering a product 239 12 ■ After product delivery: problems and revisions 256 13 ■ Wrapping up: putting the project away 274 v contents preface xv acknowledgments xvi about this book xvii about the cover illustration xxi PART 1 PREPARING AND GATHERING DATA AND KNOWLEDGE ................................................1 1 Philosophies of data science 3 1.1 Data science and this book 5 1.2 Awareness is valuable 7 1.3 Developer vs. data scientist 8 1.4 Do I need to be a software developer? 10 1.5 Do I need to know statistics? 11 1.6 Priorities: knowledge first, technology second, opinions third 12 1.7 Best practices 13 Documentation 14 ■ Code repositories and versioning 14 Code organization 15 ■ Ask questions 16 ■ Stay close to the data 17 1.8 Reading this book: how I discuss concepts 17 vii viii CONTENTS 2 Setting goals by asking good questions 19 2.1 Listening to the customer 20 Resolving wishes and pragmatism 20 ■ The customer is probably not a data scientist 22 ■ Asking specific questions to uncover fact, not opinions 23 ■ Suggesting deliverables: guess and check 24 Iterate your ideas based on knowledge, not wishes 25 2.2 Ask good questions—of the data 26 Good questions are concrete in their assumptions 27 Good answers: measurable success without too much cost 29 2.3 Answering the question using data 30 Is the data relevant and sufficient? 31 ■ Has someone done this before? 32 ■ Figuring out what data and software you could use 32 ■ Anticipate obstacles to getting everything you want 33 2.4 Setting goals 34 What is possible? 34 ■ What is valuable? 34 ■ What is efficient? 35 2.5 Planning: be flexible 35 3 Data all around us: the virtual wilderness 37 3.1 Data as the object of study 37 The users of computers and the internet became data generators 38 Data for its own sake 40 ■ Data scientist as explorer 41 3.2 Where data might live, and how to interact with it 44 Flat files 45 ■ HTML 47 ■ XML 48 ■ JSON 49 Relational databases 50 ■ Non-relational databases 52 APIs 52 ■ Common bad formats 54 ■ Unusual formats 55 Deciding which format to use 55 3.3 Scouting for data 56 First step: Google search 57 ■ Copyright and licensing 57 The data you have: is it enough? 58 ■ Combining data sources 59 ■ Web scraping 60 ■ Measuring or collecting things yourself 61 3.4 Example: microRNA and gene expression 62 4 Data wrangling: from capture to domestication 67 4.1 Case study: best all-time performances in track and field 68 Common heuristic comparisons 68 ■ IAAF Scoring Tables 69 Comparing performances using all data available 70

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.