www.it-ebooks.info www.it-ebooks.info Data Science by Lillian Pierson Foreword by Jake Porway Founder and Executive Director of DataKind™ www.it-ebooks.info Data Science For Dummies® Published by: John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030‐5774, www.wiley.com Copyright © 2015 by John Wiley & Sons, Inc., Hoboken, New Jersey Published simultaneously in Canada No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except as permit- ted under Sections 107 or 108 of the 1976 United States Copyright Act, without the prior written permission of the Publisher. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748‐6011, fax (201) 748‐6008, or online at http://www.wiley.com/go/permissions. Trademarks: Wiley, For Dummies, the Dummies Man logo, Dummies.com, Making Everything Easier, and related trade dress are trademarks or registered trademarks of John Wiley & Sons, Inc. and may not be used without written permission. All other trademarks are the property of their respective owners. John Wiley & Sons, Inc. is not associated with any product or vendor mentioned in this book. LIMIT OF LIABILITY/DISCLAIMER OF WARRANTY: THE PUBLISHER AND THE AUTHOR MAKE NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE ACCURACY OR COMPLETENESS OF THE CONTENTS OF THIS WORK AND SPECIFICALLY DISCLAIM ALL WARRANTIES, INCLUDING WITHOUT LIMITATION WARRANTIES OF FITNESS FOR A PARTICULAR PURPOSE. NO WARRANTY MAY BE CREATED OR EXTENDED BY SALES OR PROMOTIONAL MATERIALS. THE ADVICE AND STRATEGIES CONTAINED HEREIN MAY NOT BE SUITABLE FOR EVERY SITUATION. THIS WORK IS SOLD WITH THE UNDERSTANDING THAT THE PUBLISHER IS NOT ENGAGED IN RENDERING LEGAL, ACCOUNTING, OR OTHER PROFESSIONAL SERVICES. IF PROFESSIONAL ASSISTANCE IS REQUIRED, THE SERVICES OF A COMPETENT PROFESSIONAL PERSON SHOULD BE SOUGHT. NEITHER THE PUBLISHER NOR THE AUTHOR SHALL BE LIABLE FOR DAMAGES ARISING HEREFROM. THE FACT THAT AN ORGANIZATION OR WEBSITE IS REFERRED TO IN THIS WORK AS A CITATION AND/OR A POTENTIAL SOURCE OF FURTHER INFORMATION DOES NOT MEAN THAT THE AUTHOR OR THE PUBLISHER ENDORSES THE INFORMATION THE ORGANIZATION OR WEBSITE MAY PROVIDE OR RECOMMENDATIONS IT MAY MAKE. FURTHER, READERS SHOULD BE AWARE THAT INTERNET WEBSITES LISTED IN THIS WORK MAY HAVE CHANGED OR DISAPPEARED BETWEEN WHEN THIS WORK WAS WRITTEN AND WHEN IT IS READ. For general information on our other products and services, please contact our Customer Care Department within the U.S. at 877‐762‐2974, outside the U.S. at 317‐572‐3993, or fax 317‐572‐4002. For technical support, please visit www.wiley.com/techsupport. Wiley publishes in a variety of print and electronic formats and by print‐on‐demand. Some material included with standard print versions of this book may not be included in e‐books or in print‐on‐demand. If this book refers to media such as a CD or DVD that is not included in the version you purchased, you may download this material at http://booksupport.wiley.com. For more information about Wiley products, visit www.wiley.com. Library of Congress Control Number: 2014955780 ISBN 978‐1‐118‐4155‐6 (pbk); ISBN 978‐1‐118‐84145‐7 (ebk); ISBN 978‐1‐118‐84152‐5 Manufactured in the United States of America 10 9 8 7 6 5 4 3 2 1 www.it-ebooks.info Contents at a Glance Foreword ���������������������������������������������������������������������� xv Introduction ����������������������������������������������������������������� 1 Part I: Getting Started With Data Science ��������������������� 5 Chapter 1: Wrapping Your Head around Data Science .................................................7 Chapter 2: Exploring Data Engineering Pipelines and Infrastructure .......................17 Chapter 3: Applying Data Science to Business and Industry .....................................33 Part II: Using Data Science to Extract Meaning from Your Data �������������������������������������������� 47 Chapter 4: Introducing Probability and Statistics .......................................................49 Chapter 5: Clustering and Classification .......................................................................73 Chapter 6: Clustering and Classification with Nearest Neighbor Algorithms ..........87 Chapter 7: Mathematical Modeling in Data Science ....................................................99 Chapter 8: Modeling Spatial Data with Statistics .......................................................113 Part III: Creating Data Visualizations that Clearly Communicate Meaning ���������������������������������� 129 Chapter 9: Following the Principles of Data Visualization Design ...........................131 Chapter 10: Using D3.js for Data Visualization ...........................................................157 Chapter 11: Web-Based Applications for Visualization Design ................................171 Chapter 12: Exploring Best Practices in Dashboard Design.....................................189 Chapter 13: Making Maps from Spatial Data ..............................................................195 Part IV: Computing for Data Science �������������������������� 215 Chapter 14: Using Python for Data Science ................................................................217 Chapter 15: Using Open Source R for Data Science...................................................239 Chapter 16: Using SQL in Data Science .......................................................................255 Chapter 17: Software Applications for Data Science .................................................267 www.it-ebooks.info iv Data Science For Dummies Part V: Applying Domain Expertise to Solve Real-World Problems Using Data Science ������������������� 279 Chapter 18: Using Data Science in Journalism ...........................................................281 Chapter 19: Delving into Environmental Data Science..............................................299 Chapter 20: Data Science for Driving Growth in E-Commerce .................................311 Chapter 21: Using Data Science to Describe and Predict Criminal Activity ...........327 Part VI: The Part of Tens ������������������������������������������� 337 Chapter 22: Ten Phenomenal Resources for Open Data...........................................339 Chapter 23: Ten (or So) Free Data Science Tools and Applications .......................351 Index ������������������������������������������������������������������������ 365 www.it-ebooks.info Table of Contents Foreword ���������������������������������������������������������������������� xv Introduction ������������������������������������������������������������������ 1 About This Book ..............................................................................................2 Foolish Assumptions .......................................................................................2 Icons Used in This Book .................................................................................2 Beyond the Book .............................................................................................3 Where to Go from Here ...................................................................................3 Part I: Getting Started With Data Science ���������������������� 5 Chapter 1: Wrapping Your Head around Data Science . . . . . . . . . . . . . . 7 Seeing Who Can Make Use of Data Science ..................................................8 Looking at the Pieces of the Data Science Puzzle .......................................9 Collecting, querying, and consuming data .......................................10 Making use of math and statistics .....................................................11 Coding, coding, coding . . . it’s just part of the game ......................12 Applying data science to your subject area .....................................12 Communicating data insights .............................................................13 Getting a Basic Lay of the Data Science Landscape ..................................14 Exploring data science solution alternatives ...................................14 Identifying the obvious wins ..............................................................16 Chapter 2: Exploring Data Engineering Pipelines and Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 Defining Big Data by Its Four Vs ..................................................................17 Grappling with data volume ...............................................................18 Handling data velocity ........................................................................18 Dealing with data variety ....................................................................18 Creating data value ..............................................................................19 Identifying Big Data Sources ........................................................................19 Grasping the Difference between Data Science and Data Engineering.................................................................................20 Defining data science...........................................................................20 Defining data engineering ...................................................................21 Comparing data scientists and data engineers ................................22 www.it-ebooks.info vi Data Science For Dummies Boiling Down Data with MapReduce and Hadoop .....................................23 Digging into MapReduce .....................................................................23 Understanding Hadoop .......................................................................25 Identifying Alternative Big Data Solutions ..................................................27 Introducing real-time processing frameworks .................................27 Introducing Massively Parallel Processing (MPP) platforms .........28 Introducing NoSQL databases ............................................................29 Data Engineering in Action — A Case Study ..............................................30 Identifying the business challenge ....................................................30 Solving business problems with data engineering ..........................31 Boasting about benefits ......................................................................32 Chapter 3: Applying Data Science to Business and Industry . . . . . . . . 33 Incorporating Data-Driven Insights into the Business Process ...............33 Benefiting from business-centric data science ................................34 Deploying analytics and data wrangling to convert raw data into actionable insights ...................................................35 Taking action on business insights ...................................................37 Distinguishing Business Intelligence and Data Science ............................38 Defining business intelligence ............................................................39 Defining business-centric data science .............................................40 Summarizing the main differences between BI and business-centric data science ........................................................43 Knowing Who to Call to Get the Job Done Right .......................................44 Exploring Data Science in Business: A Data-Driven Business Success Story..............................................................................................45 Part II: Using Data Science to Extract Meaning from Your Data ��������������������������������������������� 47 Chapter 4: Introducing Probability and Statistics . . . . . . . . . . . . . . . . . 49 Introducing the Fundamental Concepts of Probability ............................49 Exploring the relationship between probability and inferential statistics ..................................................................50 Understanding random variables, probability distributions, and expectations .....................................................51 Getting hip to some popular probability distributions ..................53 Introducing Linear Regression .....................................................................55 Getting a handle on simple linear regression models .....................56 Learning to create a fitted regression line ........................................57 Ordinary Least Squares regression methods ...................................59 Simulations .....................................................................................................61 Using simulations to assess properties of a test statistic ..............64 Using Monte Carlo simulations to assess properties of an estimator ..................................................................................66 www.it-ebooks.info vii Table of Contents Introducing Time Series Analysis ................................................................68 Understanding patterns in time series ..............................................68 Modeling univariate time series data ................................................69 Chapter 5: Clustering and Classification . . . . . . . . . . . . . . . . . . . . . . . . . 73 Introducing the Basics of Clustering and Classification ...........................73 Getting to know clustering algorithms ..............................................74 Getting to know classification algorithms ........................................77 Getting to know similarity metrics ....................................................79 Identifying Clusters in Your Data ................................................................80 Clustering with the k-means algorithm .............................................80 Estimating clusters with Kernel Density Estimation .......................82 Clustering with hierarchical and neighborhood algorithms ..........83 Categorizing data with decision tree and random forest algorithms ..............................................................................85 Chapter 6: Clustering and Classification with Nearest Neighbor Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 Making Sense of Data with Nearest Neighbor Analysis ............................87 Seeing the Importance of Clustering and Classification ...........................88 Classifying Data with Average Nearest Neighbor Algorithms .................90 Understanding how the average nearest neighbor algorithm works ...............................................................................90 Classifying with K-Nearest Neighbor Algorithms ......................................93 Understanding how the k-nearest neighbor algorithm works .......94 Knowing when to use the k-nearest neighbor algorithm ................95 Exploring common applications of k-nearest neighbor algorithms ........................................................................96 Using Nearest Neighbor Distances to Infer Meaning from Point Patterns .............................................................................................96 Solving Real-World Problems with Nearest Neighbor Algorithms ..........97 Seeing k-nearest neighbor algorithms in action ..............................97 Seeing average nearest neighbor algorithms in action ...................98 Chapter 7: Mathematical Modeling in Data Science . . . . . . . . . . . . . . 99 Introducing Multi-Criteria Decision Making (MCDM) ...............................99 Understanding multi-criteria analysis by looking at it in action ...................................................................................100 Factoring in fuzzy multi-criteria programming ..............................102 Knowing when and how to use MCDM ............................................103 Using Numerical Methods in Data Science ...............................................107 Talking about Taylor polynomials ...................................................108 Bisecting functions with the bisection search algorithm .............110 Mathematical Modeling with Markov Chains and Stochastic Methods .................................................................................111 www.it-ebooks.info viii Data Science For Dummies Chapter 8: Modeling Spatial Data with Statistics . . . . . . . . . . . . . . . . 113 Generating Predictive Surfaces from Spatial Point Data ........................113 Understanding the (x, y, z) of spatial data modeling ....................114 Introducing kriging ............................................................................116 Krige for automated kriging interpolations ....................................116 Choosing and using models for explicitly-defined kriging interpolations ....................................................................117 Going deeper down the kriging rabbit hole ....................................118 Choosing the best-estimation method in kriging ...........................122 Analyzing residuals to determine the best-fit model ....................124 Knowing your options in kriging ......................................................126 Using Trend Surface Analysis on Spatial Data .........................................127 Part III: Creating Data Visualizations that Clearly Communicate Meaning ���������������������������������� 129 Chapter 9: Following the Principles of Data Visualization Design . . 131 Understanding the Types of Visualizations .............................................131 Data storytelling for organizational decision makers ...................132 Data showcasing for analysts ...........................................................132 Designing data art for activists ........................................................133 Focusing on Your Audience .......................................................................133 Step one: Brainstorming about Brenda ...........................................134 Step two: Defining your purpose .....................................................135 Step three: Choosing the most functional visualization type for your purpose ............................................136 Picking the Most Appropriate Design Style ..............................................136 Using design to induce a calculating, exacting response .............137 Using design to elicit a strong emotional response ......................137 Knowing When to Add Context ..................................................................139 Using data to create context ............................................................139 Creating context with annotations ..................................................139 Using graphic elements to create context ......................................140 Knowing When to Get Persuasive .............................................................141 Choosing the Most Appropriate Data Graphic Type ..............................141 Exploring the standard chart graphics ...........................................142 Exploring statistical plots .................................................................148 Exploring topology structures .........................................................151 Exploring spatial plots and maps ....................................................152 Choosing Your Data Graphic .....................................................................155 Scoping out the questions ................................................................155 Taking users and mediums into account ........................................155 Taking a final step back .....................................................................156 www.it-ebooks.info
Description: