ebook img

The Pandas Workshop: A comprehensive guide to using Python for data analysis with real-world case studies PDF

744 Pages·2022·28.935 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview The Pandas Workshop: A comprehensive guide to using Python for data analysis with real-world case studies

The Pandas Workshop A comprehensive guide to using Python for data analysis with real-world case studies Blaine Bateman Saikat Basak Thomas V. Joseph William So BIRMINGHAM—MUMBAI The Pandas Workshop Copyright © 2022 Packt Publishing All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews. Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book. Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information. Publishing Product Manager: Heramb Bhavsar Senior Editor: David Sugarman Content Development Editor: Joseph Sunil Technical Editor: Devanshi Ayare Copy Editor: Safis Editing Project Coordinator: Aparna Ravikumar Nair Proofreader: Safis Editing Indexer: Manju Arasan Production Designer: Ponraj Dhandapani Marketing Coordinator: Nivedita Singh First published: June 2022 Production reference: 1270522 Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK. ISBN 978-1-80020-893-3 www.packt.com To my wife Cynthia, who steadfastly supports me in these efforts and is a constant source of inspiration. -Blaine Bateman “To all my friends, who couldn’t believe I wrote a book about panda(s).” -William So To my mother Marykutty, and to the memory of my father V. T. Joseph, for laying the foundation of what I am. To my wife Anu, for being the pillar of support in all my endeavors. My children Joe and Tess, for reminding me that life is not all about Data Science. -Thomas V. Joseph Contributors About the authors Blaine Bateman has more than 35 years of experience working with various industries, from government R&D to start-ups to $1 billion public companies. His experience focuses on analytics, including machine learning and forecasting. His hands-on abilities include Python and R coding, Keras/TensorFlow, and AWS and Azure machine learning services. As a machine learning consultant, he has developed and deployed actual machine learning models in industry. Saikat Basak is a data scientist and a passionate programmer. Having worked with multiple industry leaders, he has a good understanding of problem areas that can potentially be solved using data. Apart from being a data guy, he is also a science geek and loves to explore new ideas on the frontiers of science and technology. Thomas V. Joseph is a data science practitioner, researcher, trainer, mentor, and writer with more than 19 years of experience. He has extensive experience in solving business problems using machine learning toolsets across multiple industry segments. William So is a Data Scientist with both a strong academic background and extensive professional experience. He is currently the Head of Data Science at Douugh and also a Lecturer for Master of Data Science and Innovation at the University of Technology Sydney. During his career, he successfully covered the end-end spectrum of data analytics from ML to Business Intelligence helping stakeholders derive valuable insights and achieve amazing results that benefits the business. William So is a co-author of the "The Applied Artificial Intelligence Workshop" published by Packt. About the reviewer Vishwesh Ravi Shrimali graduated from BITS Pilani, where he studied mechanical engineering, in 2018. He also completed a masters in machine learning and AI at LJMU in 2021. He authored Machine Learning for OpenCV (2nd edition) and Computer Vision Workshop and Data Science for Marketing Analytics (2nd edition), both available from Packt. When he is not writing blogs or working on projects, he likes to go on long walks or play his acoustic guitar. Table of Contents Preface Part 1 – Introduction to pandas 1 Introduction to pandas Introduction to the world of Data types in pandas 17 pandas 4 Data selection 19 Exploring the history and Data transformation 21 evolution of pandas 4 Data visualization 22 Time series data 24 Components and applications Code optimization 26 of pandas 5 Utility functions 27 Understanding the basic Exercise 1.02 – basic numerical concepts of pandas 6 operations with pandas 32 The Series object 7 Data modeling 39 The DataFrame object 9 Exercise 1.03 – comparing data from Working with local files 12 two DataFrames 41 Reading a CSV file 13 Activity 1.01 – comparing sales Displaying a snapshot of the data 14 data for two stores 50 Writing data to a file 14 Summary 51 2 Working with Data Structures Introduction to data structures 54 Creating DataFrames in pandas 57 The need for data structures 54 Exercise 2.01 – Creating a DataFrame 64 Data structures 56 Indexes and columns 65 Table of Contents iii Exercise 2.02 – Reading DataFrames Using time as the index 89 and manipulating the index 73 Exercise 2.04 – DataFrame indices 92 Working with columns 77 Activity 2.01 – Working with Series 79 pandas data structures 96 The Series index 81 Summary 97 Exercise 2.03 – Series to DataFrames 85 3 Data I/O The world of data 100 Working with HTML/XML 126 Exploring data sources 103 Working with XML data 129 Working with Excel 131 Text files and binary files 104 SAS data 134 Online data sources 106 SPSS data 135 Exercise 3.01 – reading data from web Stata data 136 pages 108 HDF5 data 137 Fundamental formats 110 Manipulating SQL data 138 Text data 111 Exercise 3.03 – working with SQL 144 Exercise 3.02 – text character encoding and data separators 117 Choosing a format for a project 149 Binary data 120 Activity 3.01 – using SQL data Databases – SQL data 120 for pandas analytics 150 sqlite3 122 Summary 151 Additional text formats 123 Working with JSON 124 4 Pandas Data Types Introducing pandas dtypes 153 Nullable types 170 Obtaining the underlying data types 154 Exercise 4.02 – missing data and converting into non-nullable dtypes 172 Converting from one type into another 157 Exercise 4.01 – underlying data types Activity 4.01 – optimizing and conversion 162 memory usage by converting into the appropriate dtypes 176 Missing data types 169 Subsetting by data types 177 The missing alphabet soup 169 iv Table of Contents Working with the dtype category 180 Exercise 4.03 – working with text data Working with dtype = datetime64[ns] 182 using string methods 188 Working with dtype = timedelta64[ns] 186 Selecting data in a DataFrame by its dtype 191 Summary 194 Part 2 – Working with Data 5 Data Selection – DataFrames Introduction to DataFrames 198 Activity 5.01: Creating a multi- index from columns 229 The need for data selection methods 198 Bracket and dot notation 231 Data selection in pandas Bracket notation 231 DataFrames 199 Dot notation 231 The index and its forms 201 Exercise 5.03 – integer row numbers Exercise 5.01 – identifying the row and versus labels 234 column indices in a dataset 204 Using extended indexing 238 Slicing and indexing methods 207 Type exceptions 241 Exercise 5.02 – subsetting rows and columns 217 Changing DataFrame values Using labels as the index and the using bracket or dot notation 244 pandas multi-index 220 Exercise 5.04 – selecting data using Creating a multi-index from columns 226 bracket and dot notation 248 Summary 253 6 Data Selection – Series Introduction to pandas Series 256 Preparing Series from The Series index 256 DataFrames and vice versa 266 Data selection in a pandas Series 258 Exercise 6.02 – using a Series index to select values 271 Brackets, dots, Series.loc, and Series.iloc 258 Activity 6.1 – Series data Exercise 6.01 – basic Series data selection 276 selection 262

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.