ebook img

Python Data Analysis: Perform data collection, data processing, wrangling, visualization, model building using Python PDF

463 Pages·2021·13.881 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Python Data Analysis: Perform data collection, data processing, wrangling, visualization, model building using Python

Python Data Analysis Third Edition Perform data collection, data processing, wrangling, visualization, and model building using Python Avinash Navlani Armando Fandango Ivan Idris BIRMINGHAM - MUMBAI Python Data Analysis Third Edition Copyright © 2021 Packt Publishing All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews. Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book. Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information. Group Product Manager: Kunal Parikh Publishing Product Manager: Ali Abidi Content Development Editor: Joseph Sunil Senior Editor: Roshan Kumar Technical Editor: Sonam Pandey Copy Editor: Safis Editing Project Coordinator: Aishwarya Mohan Proofreader: Safis Editing Indexer: Rekha Nair Production Designer: Roshan Kawale First published: October 2014 Second edition: March 2017 Third Edition: February 2021 Production reference: 1070121 Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK. ISBN 978-1-78995-524-8 www.packt.com Packt.com Subscribe to our online digital library for full access to over 7,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website. Why subscribe? Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals Improve your learning with Skill Plans built especially for you Get a free eBook or video every month Fully searchable for easy access to vital information Copy and paste, print, and bookmark content Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.packt.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details. At www.packt.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks. Contributors About the authors Avinash Navlani has over 8 years of experience working in data science and AI. Currently, he is working as a senior data scientist, improving products and services for customers by using advanced analytics, deploying big data analytical tools, creating and maintaining models, and onboarding compelling new datasets. Previously, he was a university lecturer, where he trained and educated people in data science subjects such as Python for analytics, data mining, machine learning, database management, and NoSQL. Avinash has been involved in research activities in data science and has been a keynote speaker at many conferences in India. Armando Fandango creates AI-empowered products by leveraging his expertise in deep learning, machine learning, distributed computing, and computational methods and has provided thought leadership roles as the chief data scientist and director at start-ups and large enterprises. He has advised high-tech AI-based start-ups. Armando has authored books such as Python Data Analysis - Second Edition and Mastering TensorFlow, Packt Publishing. He has also published research in international journals and conferences. Ivan Idris has an MSc in experimental physics. His graduation thesis had a strong emphasis on applied computer science. After graduating, he worked for several companies as a Java developer, data warehouse developer, and QA analyst. His main professional interests are business intelligence, big data, and cloud computing. Ivan Idris enjoys writing clean, testable code and interesting technical articles. Ivan Idris is the author of NumPy 1.5 Beginner's Guide and NumPy Cookbook by Packt Publishing. You can find more information and a blog with a few NumPy examples at ivanidris.net. About the reviewers Greg Walters has been involved with computers and computer programming since 1972. He is well versed in Visual Basic, Visual Basic .NET, Python, and SQL and is an accomplished user of MySQL, SQLite, Microsoft SQL Server, Oracle, C++, Delphi, Modula-2, Pascal, C, 80x86 Assembler, COBOL, and Fortran. He is a programming trainer and has trained numerous people on many pieces of computer software, including MySQL, Open Database Connectivity, Quattro Pro, Corel Draw!, Paradox, Microsoft Word, Excel, DOS, Windows 3.11, Windows for Workgroups, Windows 95, Windows NT, Windows 2000, Windows XP, and Linux. He is semi-retired and has written over 100 articles for Full Circle Magazine. He is also a musician and loves to cook. He is open to working as a freelancer on various projects. Alistair McMaster is currently employed as a Software Engineer and Quantitative Strategist at a major financial services firm. He graduated from the University of Cambridge in 2016 with a B.A. (Hons) in Natural Sciences specializing in Astrophysics. His broader career interests include applications of data science to relationship networks and supporting social causes. Alistair is an active contributor to pandas and a strong advocate of open-source software. In his spare time, he enjoys distance running, cycling, rock climbing, and walks with his family and friends on weekends. Packt is searching for authors like you If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea. Table of Contents Preface 1 Section 1: Foundation for Data Analysis Chapter 1: Getting Started with Python Libraries 7 Understanding data analysis 8 The standard process of data analysis 9 The KDD process 10 SEMMA 11 CRISP-DM 12 Comparing data analysis and data science 14 The roles of data analysts and data scientists 14 The skillsets of data analysts and data scientists 15 Installing Python 3 17 Python installation and setup on Windows 17 Python installation and setup on Linux 18 Python installation and setup on Mac OS X with a GUI installer 18 Python installation and setup on Mac OS X with brew 18 Software used in this book 19 Using IPython as a shell 20 Reading manual pages 23 Where to find help and references to Python data analysis libraries 24 Using JupyterLab 24 Using Jupyter Notebooks 26 Advanced features of Jupyter Notebooks 27 Keyboard shortcuts 28 Installing other kernels 29 Running shell commands 30 Extensions for Notebook 30 Summary 36 Chapter 2: NumPy and pandas 37 Technical requirements 38 Understanding NumPy arrays 38 Array features 41 Selecting array elements 42 NumPy array numerical data types 43 dtype objects 45 Data type character codes 46 Table of Contents dtype constructors 47 dtype attributes 47 Manipulating array shapes 48 The stacking of NumPy arrays 50 Partitioning NumPy arrays 53 Changing the data type of NumPy arrays 55 Creating NumPy views and copies 56 Slicing NumPy arrays 58 Boolean and fancy indexing 60 Broadcasting arrays 61 Creating pandas DataFrames 63 Understanding pandas Series 65 Reading and querying the Quandl data 68 Describing pandas DataFrames 72 Grouping and joining pandas DataFrame 75 Working with missing values 79 Creating pivot tables 81 Dealing with dates 83 Summary 85 References 85 Chapter 3: Statistics 86 Technical requirements 87 Understanding attributes and their types 87 Types of attributes 87 Discrete and continuous attributes 89 Measuring central tendency 89 Mean 89 Mode 90 Median 91 Measuring dispersion 91 Skewness and kurtosis 95 Understanding relationships using covariance and correlation coefficients 96 Pearson's correlation coefficient 97 Spearman's rank correlation coefficient 98 Kendall's rank correlation coefficient 98 Central limit theorem 98 Collecting samples 100 Performing parametric tests 101 Performing non-parametric tests 107 Summary 113 Chapter 4: Linear Algebra 114 [ ii ] Table of Contents Technical requirements 115 Fitting to polynomials with NumPy 115 Determinant 117 Finding the rank of a matrix 117 Matrix inverse using NumPy 118 Solving linear equations using NumPy 119 Decomposing a matrix using SVD 120 Eigenvectors and Eigenvalues using NumPy 122 Generating random numbers 123 Binomial distribution 124 Normal distribution 126 Testing normality of data using SciPy 127 Creating a masked array using the numpy.ma subpackage 131 Summary 133 Section 2: Exploratory Data Analysis and Data Cleaning Chapter 5: Data Visualization 135 Technical requirements 135 Visualization using Matplotlib 136 Accessories for charts 137 Scatter plot 139 Line plot 140 Pie plot 142 Bar plot 143 Histogram plot 144 Bubble plot 146 pandas plotting 148 Advanced visualization using the Seaborn package 150 lm plots 151 Bar plots 154 Distribution plots 155 Box plots 156 KDE plots 157 Violin plots 158 Count plots 159 Joint plots 161 Heatmaps 162 Pair plots 164 Interactive visualization with Bokeh 166 Plotting a simple graph 166 Glyphs 168 Layouts 169 [ iii ] Table of Contents Nested layout using row and column layouts 173 Multiple plots 175 Interactions 177 Hide click policy 177 Mute click policy 179 Annotations 180 Hover tool 183 Widgets 184 Tab panel 185 Slider 186 Summary 189 Chapter 6: Retrieving, Processing, and Storing Data 190 Technical requirements 191 Reading and writing CSV files with NumPy 191 Reading and writing CSV files with pandas 192 Reading and writing data from Excel 194 Reading and writing data from JSON 195 Reading and writing data from HDF5 196 Reading and writing data from HTML tables 197 Reading and writing data from Parquet 198 Reading and writing data from a pickle pandas object 199 Lightweight access with sqllite3 200 Reading and writing data from MySQL 201 Inserting a whole DataFrame into the database 204 Reading and writing data from MongoDB 205 Reading and writing data from Cassandra 206 Reading and writing data from Redis 207 PonyORM 208 Summary 209 Chapter 7: Cleaning Messy Data 210 Technical requirements 211 Exploring data 211 Filtering data to weed out the noise 214 Column-wise filtration 215 Row-wise filtration 217 Handling missing values 220 Dropping missing values 221 Filling in a missing value 221 Handling outliers 223 Feature encoding techniques 226 One-hot encoding 226 Label encoding 228 Ordinal encoder 229 [ iv ]

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.