ebook img

Python for Data Science For Dummies PDF

435 Pages·2015·9.72 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Python for Data Science For Dummies

Python® for Data Science by Luca Massaron and John Paul Mueller Python® for Data Science For Dummies® Published by: John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030‐5774, www.wiley.com Copyright © 2015 by John Wiley & Sons, Inc., Hoboken, New Jersey Media and software compilation copyright © 2015 by John Wiley & Sons, Inc. All rights reserved. Published simultaneously in Canada No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except as permit- ted under Sections 107 or 108 of the 1976 United States Copyright Act, without the prior written permission of the Publisher. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748‐6011, fax (201) 748‐6008, or online at http://www.wiley.com/go/permissions. Trademarks: Wiley, For Dummies, the Dummies Man logo, Dummies.com, Making Everything Easier, and related trade dress are trademarks or registered trademarks of John Wiley & Sons, Inc. and may not be used without written permission. Python is a registered trademark of Python Software Foundation Corporation. All other trademarks are the property of their respective owners. John Wiley & Sons, Inc. is not associated with any product or vendor mentioned in this book. LIMIT OF LIABILITY/DISCLAIMER OF WARRANTY: THE PUBLISHER AND THE AUTHOR MAKE NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE ACCURACY OR COMPLETENESS OF THE CONTENTS OF THIS WORK AND SPECIFICALLY DISCLAIM ALL WARRANTIES, INCLUDING WITHOUT LIMITATION WARRANTIES OF FITNESS FOR A PARTICULAR PURPOSE. NO WARRANTY MAY BE CREATED OR EXTENDED BY SALES OR PROMOTIONAL MATERIALS. THE ADVICE AND STRATEGIES CONTAINED HEREIN MAY NOT BE SUITABLE FOR EVERY SITUATION. THIS WORK IS SOLD WITH THE UNDERSTANDING THAT THE PUBLISHER IS NOT ENGAGED IN RENDERING LEGAL, ACCOUNTING, OR OTHER PROFESSIONAL SERVICES. IF PROFESSIONAL ASSISTANCE IS REQUIRED, THE SERVICES OF A COMPETENT PROFESSIONAL PERSON SHOULD BE SOUGHT. NEITHER THE PUBLISHER NOR THE AUTHOR SHALL BE LIABLE FOR DAMAGES ARISING HEREFROM. THE FACT THAT AN ORGANIZATION OR WEBSITE IS REFERRED TO IN THIS WORK AS A CITATION AND/OR A POTENTIAL SOURCE OF FURTHER INFORMATION DOES NOT MEAN THAT THE AUTHOR OR THE PUBLISHER ENDORSES THE INFORMATION THE ORGANIZATION OR WEBSITE MAY PROVIDE OR RECOMMENDATIONS IT MAY MAKE. FURTHER, READERS SHOULD BE AWARE THAT INTERNET WEBSITES LISTED IN THIS WORK MAY HAVE CHANGED OR DISAPPEARED BETWEEN WHEN THIS WORK WAS WRITTEN AND WHEN IT IS READ. For general information on our other products and services, please contact our Customer Care Department within the U.S. at 877‐762‐2974, outside the U.S. at 317‐572‐3993, or fax 317‐572‐4002. For technical support, please visit www.wiley.com/techsupport. Wiley publishes in a variety of print and electronic formats and by print‐on‐demand. Some material included with standard print versions of this book may not be included in e‐books or in print‐on‐demand. If this book refers to media such as a CD or DVD that is not included in the version you purchased, you may download this material at http://booksupport.wiley.com. For more information about Wiley prod- ucts, visit www.wiley.com. Library of Congress Control Number: 2013956848 ISBN: 978‐1‐118‐84418‐2 ISBN 978-1-118-84398-7 (ebk); ISBN ePDF 978-1-118-84414-4 (ebk) Manufactured in the United States of America 10 9 8 7 6 5 4 3 2 1 Table of Contents Introduction ................................................................. 1 About This Book ..............................................................................................1 Foolish Assumptions .......................................................................................2 Icons Used in This Book .................................................................................3 Beyond the Book .............................................................................................4 Where to Go from Here ...................................................................................5 Part I: Getting Started with Python for Data Science ...... 7 Chapter 1: Discovering the Match between Data Science and Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Defining the Sexiest Job of the 21st Century ..............................................11 Considering the emergence of data science.....................................11 Outlining the core competencies of a data scientist .......................12 Linking data science and big data .....................................................13 Understanding the role of programming ..........................................13 Creating the Data Science Pipeline ..............................................................14 Preparing the data ...............................................................................14 Performing exploratory data analysis ...............................................15 Learning from data ..............................................................................15 Visualizing .............................................................................................15 Obtaining insights and data products ...............................................15 Understanding Python’s Role in Data Science ...........................................16 Considering the shifting profile of data scientists ...........................16 Working with a multipurpose, simple, and efficient language .......17 Learning to Use Python Fast ........................................................................18 Loading data .........................................................................................18 Training a model ..................................................................................18 Viewing a result ....................................................................................20 Chapter 2: Introducing Python’s Capabilities and Wonders . . . . . . . . 21 Why Python? ..................................................................................................22 Grasping Python’s core philosophy ..................................................23 Discovering present and future development goals .......................23 Working with Python ....................................................................................24 Getting a taste of the language ...........................................................24 Understanding the need for indentation ..........................................25 Working at the command line or in the IDE .....................................25 iv Python for Data Science For Dummies Performing Rapid Prototyping and Experimentation ...............................29 Considering Speed of Execution ..................................................................30 Visualizing Power ..........................................................................................32 Using the Python Ecosystem for Data Science ..........................................33 Accessing scientific tools using SciPy ...............................................33 Performing fundamental scientific computing using NumPy .........34 Performing data analysis using pandas ............................................34 Implementing machine learning using Scikit‐learn ..........................35 Plotting the data using matplotlib .....................................................35 Parsing HTML documents using Beautiful Soup ..............................35 Chapter 3: Setting Up Python for Data Science . . . . . . . . . . . . . . . . . . . 37 Considering the Off‐the‐Shelf Cross‐Platform Scientific Distributions ...............................................................................................38 Getting Continuum Analytics Anaconda ...........................................39 Getting Enthought Canopy Express...................................................40 Getting pythonxy .................................................................................40 Getting WinPython ...............................................................................41 Installing Anaconda on Windows ................................................................41 Installing Anaconda on Linux .......................................................................45 Installing Anaconda on Mac OS X ................................................................46 Downloading the Datasets and Example Code ..........................................47 Using IPython Notebook .....................................................................47 Defining the code repository ..............................................................48 Understanding the datasets used in this book ................................54 Chapter 4: Reviewing Basic Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Working with Numbers and Logic ...............................................................59 Performing variable assignments ......................................................60 Doing arithmetic ..................................................................................61 Comparing data using Boolean expressions ....................................62 Creating and Using Strings ...........................................................................65 Interacting with Dates ...................................................................................66 Creating and Using Functions ......................................................................68 Creating reusable functions ...............................................................68 Calling functions in a variety of ways ................................................70 Using Conditional and Loop Statements ....................................................73 Making decisions using the if statement ...........................................73 Choosing between multiple options using nested decisions .........74 Performing repetitive tasks using for ................................................75 Using the while statement ..................................................................76 Storing Data Using Sets, Lists, and Tuples .................................................77 Performing operations on sets ...........................................................77 Working with lists ................................................................................78 Creating and using Tuples ..................................................................80 Defining Useful Iterators ...............................................................................81 Indexing Data Using Dictionaries .................................................................82 v Table of Contents Part II: Getting Your Hands Dirty with Data ................. 83 Chapter 5: Working with Real Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 Uploading, Streaming, and Sampling Data .................................................86 Uploading small amounts of data into memory ...............................87 Streaming large amounts of data into memory ................................88 Sampling data .......................................................................................89 Accessing Data in Structured Flat‐File Form ..............................................90 Reading from a text file .......................................................................91 Reading CSV delimited format ...........................................................92 Reading Excel and other Microsoft Office files ................................94 Sending Data in Unstructured File Form ....................................................95 Managing Data from Relational Databases .................................................98 Interacting with Data from NoSQL Databases .........................................100 Accessing Data from the Web ....................................................................101 Chapter 6: Conditioning Your Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 Juggling between NumPy and pandas ......................................................106 Knowing when to use NumPy ...........................................................106 Knowing when to use pandas ...........................................................106 Validating Your Data ...................................................................................107 Figuring out what’s in your data ......................................................108 Removing duplicates .........................................................................109 Creating a data map and data plan ..................................................110 Manipulating Categorical Variables ..........................................................112 Creating categorical variables..........................................................113 Renaming levels .................................................................................114 Combining levels ................................................................................115 Dealing with Dates in Your Data ................................................................116 Formatting date and time values .....................................................117 Using the right time transformation ................................................117 Dealing with Missing Data ..........................................................................118 Finding the missing data ...................................................................119 Encoding missingness .......................................................................119 Imputing missing data .......................................................................120 Slicing and Dicing: Filtering and Selecting Data .......................................122 Slicing rows .........................................................................................122 Slicing columns ..................................................................................123 Dicing ...................................................................................................123 Concatenating and Transforming ..............................................................124 Adding new cases and variables ......................................................125 Removing data ....................................................................................126 Sorting and shuffling ..........................................................................127 Aggregating Data at Any Level ...................................................................128 vi Python for Data Science For Dummies Chapter 7: Shaping Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 Working with HTML Pages .........................................................................132 Parsing XML and HTML.....................................................................132 Using XPath for data extraction .......................................................133 Working with Raw Text ...............................................................................134 Dealing with Unicode.........................................................................134 Stemming and removing stop words ...............................................136 Introducing regular expressions ......................................................137 Using the Bag of Words Model and Beyond .............................................140 Understanding the bag of words model ..........................................141 Working with n‐grams .......................................................................142 Implementing TF‐IDF transformations ............................................144 Working with Graph Data ...........................................................................145 Understanding the adjacency matrix ..............................................146 Using NetworkX basics......................................................................146 Chapter 8: Putting What You Know in Action . . . . . . . . . . . . . . . . . . . . 149 Contextualizing Problems and Data ..........................................................150 Evaluating a data science problem ..................................................151 Researching solutions .......................................................................151 Formulating a hypothesis .................................................................152 Preparing your data ...........................................................................153 Considering the Art of Feature Creation ..................................................153 Defining feature creation ..................................................................153 Combining variables ..........................................................................154 Understanding binning and discretization .....................................155 Using indicator variables ..................................................................155 Transforming distributions ..............................................................156 Performing Operations on Arrays .............................................................156 Using vectorization ............................................................................157 Performing simple arithmetic on vectors and matrices ...............157 Performing matrix vector multiplication ........................................158 Performing matrix multiplication ....................................................159 Part III: Visualizing the Invisible ............................... 161 Chapter 9: Getting a Crash Course in MatPlotLib . . . . . . . . . . . . . . . . . 163 Starting with a Graph ..................................................................................164 Defining the plot .................................................................................164 Drawing multiple lines and plots .....................................................165 Saving your work ...............................................................................165 Setting the Axis, Ticks, Grids .....................................................................166 Getting the axes .................................................................................167 vii Table of Contents Formatting the axes ...........................................................................167 Adding grids .......................................................................................168 Defining the Line Appearance ....................................................................169 Working with line styles ....................................................................170 Using colors ........................................................................................170 Adding markers ..................................................................................172 Using Labels, Annotations, and Legends ..................................................173 Adding labels ......................................................................................174 Annotating the chart .........................................................................174 Creating a legend ...............................................................................175 Chapter 10: Visualizing the Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 Choosing the Right Graph ..........................................................................180 Showing parts of a whole with pie charts .......................................180 Creating comparisons with bar charts ...........................................181 Showing distributions using histograms ........................................183 Depicting groups using box plots ....................................................184 Seeing data patterns using scatterplots .........................................185 Creating Advanced Scatterplots ................................................................187 Depicting groups ................................................................................187 Showing correlations .........................................................................188 Plotting Time Series ....................................................................................189 Representing time on axes ...............................................................190 Plotting trends over time ..................................................................191 Plotting Geographical Data ........................................................................193 Visualizing Graphs .......................................................................................195 Developing undirected graphs .........................................................195 Developing directed graphs .............................................................197 Chapter 11: Understanding the Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . 199 Using the IPython Console .........................................................................200 Interacting with screen text..............................................................200 Changing the window appearance...................................................202 Getting Python help ...........................................................................203 Getting IPython help ..........................................................................205 Using magic functions .......................................................................205 Discovering objects ...........................................................................207 Using IPython Notebook .............................................................................208 Working with styles ...........................................................................208 Restarting the kernel .........................................................................210 Restoring a checkpoint .....................................................................210 Performing Multimedia and Graphic Integration ....................................212 Embedding plots and other images .................................................212 Loading examples from online sites ................................................212 Obtaining online graphics and multimedia ....................................212 viii Python for Data Science For Dummies Part IV: Wrangling Data ........................................... 215 Chapter 12: Stretching Python’s Capabilities . . . . . . . . . . . . . . . . . . . . 217 Playing with Scikit‐learn .............................................................................218 Understanding classes in Scikit‐learn .............................................218 Defining applications for data science ............................................219 Performing the Hashing Trick ....................................................................222 Using hash functions .........................................................................223 Demonstrating the hashing trick .....................................................223 Working with deterministic selection .............................................225 Considering Timing and Performance ......................................................227 Benchmarking with timeit.................................................................228 Working with the memory profiler ..................................................230 Running in Parallel ......................................................................................232 Performing multicore parallelism ....................................................232 Demonstrating multiprocessing.......................................................233 Chapter 13: Exploring Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 235 The EDA Approach ......................................................................................236 Defining Descriptive Statistics for Numeric Data ....................................237 Measuring central tendency .............................................................238 Measuring variance and range .........................................................239 Working with percentiles ..................................................................239 Defining measures of normality .......................................................240 Counting for Categorical Data ....................................................................241 Understanding frequencies ..............................................................242 Creating contingency tables .............................................................243 Creating Applied Visualization for EDA ....................................................243 Inspecting boxplots ...........................................................................244 Performing t‐tests after boxplots .....................................................245 Observing parallel coordinates ........................................................246 Graphing distributions ......................................................................247 Plotting scatterplots ..........................................................................248 Understanding Correlation .........................................................................250 Using covariance and correlation ....................................................250 Using nonparametric correlation ....................................................252 Considering chi‐square for tables ....................................................253 Modifying Data Distributions .....................................................................253 Using the normal distribution ..........................................................254 Creating a Z‐score standardization .................................................254 Transforming other notable distributions......................................254

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.