ebook img

Python for Data Analysis: Data Wrangling with pandas, NumPy, and Jupyter PDF

582 Pages·2022·8.95 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Python for Data Analysis: Data Wrangling with pandas, NumPy, and Jupyter

T E hi di r tid o n P ython for Data Analysis Data Wrangling with pandas, NumPy & Jupyter powered by Wes McKinney Python for Data Analysis Get the definitive handbook for manipulating, processing, cleaning, and crunching datasets in Python. Updated for “With this new edition, Python 3.10 and pandas 1.4, the third edition of this hands- Wes has updated his on guide is packed with practical case studies that show you book to ensure it remains how to solve a broad set of data analysis problems effectively. the go-to resource for You’ll learn the latest versions of pandas, NumPy, and Jupyter all things related to data in the process. analysis with Python Written by Wes McKinney, the creator of the Python pandas and pandas. I cannot project, this book is a practical, modern introduction to recommend this book data science tools in Python. It’s ideal for analysts new to Python and for Python programmers new to data science highly enough.” and scientific computing. Data files and related material are —Paul Barry available on GitHub. Lecturer and author of O’Reilly’s Head First Python • Use the Jupyter notebook and the IPython shell for exploratory computing Wes McKinney, cofounder and chief • Learn basic and advanced features in NumPy technology officer of Voltron Data, is • Get started with data analysis tools in the pandas library an active member of the Python data community and an advocate for Python • Use flexible tools to load, clean, transform, merge, and use in data analysis, finance, and reshape data statistical computing applications. A • Create informative visualizations with matplotlib graduate of MIT, he’s also a member of the project management committees • Apply the pandas groupBy facility to slice, dice, and for the Apache Software Foundation’s summarize datasets Apache Arrow and Apache Parquet projects. • Analyze and manipulate regular and irregular time series data • Learn how to solve real-world data analysis problems with thorough, detailed examples DATA Twitter: @oreillymedia linkedin.com/company/oreilly-media US $69.99 CAN $87.99 youtube.com/oreillymedia ISBN: 978-1-098-10403-0 56999 9 781098 104030 THIRD EDITION Python for Data Analysis Data Wrangling with pandas, NumPy, and Jupyter Wes McKinney BBeeiijjiinngg BBoossttoonn FFaarrnnhhaamm SSeebbaassttooppooll TTookkyyoo Python for Data Analysis by Wes McKinney Copyright © 2022 Wesley McKinney. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or [email protected]. Acquisitions Editor: Jessica Haberman Indexer: Sue Klefstad Development Editor: Angela Rufino Interior Designer: David Futato Production Editor: Christopher Faucher Cover Designer: Karen Montgomery Copyeditor: Sonia Saruba Illustrator: Kate Dullea Proofreader: Piper Editorial Consulting, LLC October 2012: First Edition October 2017: Second Edition August 2022: Third Edition Revision History for the Third Edition 2022-08-12: First Release See https://www.oreilly.com/catalog/errata.csp?isbn=0636920519829 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Python for Data Analysis, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. 978-1-098-10403-0 [LSI] Table of Contents Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi 1. Preliminaries. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 What Is This Book About? 1 What Kinds of Data? 1 1.2 Why Python for Data Analysis? 2 Python as Glue 3 Solving the “Two-Language” Problem 3 Why Not Python? 3 1.3 Essential Python Libraries 4 NumPy 4 pandas 5 matplotlib 6 IPython and Jupyter 6 SciPy 7 scikit-learn 8 statsmodels 8 Other Packages 9 1.4 Installation and Setup 9 Miniconda on Windows 9 GNU/Linux 10 Miniconda on macOS 11 Installing Necessary Packages 11 Integrated Development Environments and Text Editors 12 1.5 Community and Conferences 13 1.6 Navigating This Book 14 Code Examples 15 iii Data for Examples 15 Import Conventions 16 2. Python Language Basics, IPython, and Jupyter Notebooks. . . . . . . . . . . . . . . . . . . . . . . . 17 2.1 The Python Interpreter 18 2.2 IPython Basics 19 Running the IPython Shell 19 Running the Jupyter Notebook 20 Tab Completion 23 Introspection 25 2.3 Python Language Basics 26 Language Semantics 26 Scalar Types 34 Control Flow 42 2.4 Conclusion 45 3. Built-In Data Structures, Functions, and Files. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.1 Data Structures and Sequences 47 Tuple 47 List 51 Dictionary 55 Set 59 Built-In Sequence Functions 62 List, Set, and Dictionary Comprehensions 63 3.2 Functions 65 Namespaces, Scope, and Local Functions 67 Returning Multiple Values 68 Functions Are Objects 69 Anonymous (Lambda) Functions 70 Generators 71 Errors and Exception Handling 74 3.3 Files and the Operating System 76 Bytes and Unicode with Files 80 3.4 Conclusion 82 4. NumPy Basics: Arrays and Vectorized Computation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.1 The NumPy ndarray: A Multidimensional Array Object 85 Creating ndarrays 86 Data Types for ndarrays 88 Arithmetic with NumPy Arrays 91 Basic Indexing and Slicing 92 iv | Table of Contents Boolean Indexing 97 Fancy Indexing 100 Transposing Arrays and Swapping Axes 102 4.2 Pseudorandom Number Generation 103 4.3 Universal Functions: Fast Element-Wise Array Functions 105 4.4 Array-Oriented Programming with Arrays 108 Expressing Conditional Logic as Array Operations 110 Mathematical and Statistical Methods 111 Methods for Boolean Arrays 113 Sorting 114 Unique and Other Set Logic 115 4.5 File Input and Output with Arrays 116 4.6 Linear Algebra 116 4.7 Example: Random Walks 118 Simulating Many Random Walks at Once 120 4.8 Conclusion 121 5. Getting Started with pandas. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 5.1 Introduction to pandas Data Structures 124 Series 124 DataFrame 129 Index Objects 136 5.2 Essential Functionality 138 Reindexing 138 Dropping Entries from an Axis 141 Indexing, Selection, and Filtering 142 Arithmetic and Data Alignment 152 Function Application and Mapping 158 Sorting and Ranking 160 Axis Indexes with Duplicate Labels 164 5.3 Summarizing and Computing Descriptive Statistics 165 Correlation and Covariance 168 Unique Values, Value Counts, and Membership 170 5.4 Conclusion 173 6. Data Loading, Storage, and File Formats. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 6.1 Reading and Writing Data in Text Format 175 Reading Text Files in Pieces 182 Writing Data to Text Format 184 Working with Other Delimited Formats 185 JSON Data 187 Table of Contents | v XML and HTML: Web Scraping 189 6.2 Binary Data Formats 193 Reading Microsoft Excel Files 194 Using HDF5 Format 195 6.3 Interacting with Web APIs 197 6.4 Interacting with Databases 199 6.5 Conclusion 201 7. Data Cleaning and Preparation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 7.1 Handling Missing Data 203 Filtering Out Missing Data 205 Filling In Missing Data 207 7.2 Data Transformation 209 Removing Duplicates 209 Transforming Data Using a Function or Mapping 211 Replacing Values 212 Renaming Axis Indexes 214 Discretization and Binning 215 Detecting and Filtering Outliers 217 Permutation and Random Sampling 219 Computing Indicator/Dummy Variables 221 7.3 Extension Data Types 224 7.4 String Manipulation 227 Python Built-In String Object Methods 227 Regular Expressions 229 String Functions in pandas 232 7.5 Categorical Data 235 Background and Motivation 236 Categorical Extension Type in pandas 237 Computations with Categoricals 240 Categorical Methods 242 7.6 Conclusion 245 8. Data Wrangling: Join, Combine, and Reshape. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247 8.1 Hierarchical Indexing 247 Reordering and Sorting Levels 250 Summary Statistics by Level 251 Indexing with a DataFrame’s columns 252 8.2 Combining and Merging Datasets 253 Database-Style DataFrame Joins 254 Merging on Index 259 vi | Table of Contents Concatenating Along an Axis 263 Combining Data with Overlap 268 8.3 Reshaping and Pivoting 270 Reshaping with Hierarchical Indexing 270 Pivoting “Long” to “Wide” Format 273 Pivoting “Wide” to “Long” Format 277 8.4 Conclusion 279 9. Plotting and Visualization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281 9.1 A Brief matplotlib API Primer 282 Figures and Subplots 283 Colors, Markers, and Line Styles 288 Ticks, Labels, and Legends 290 Annotations and Drawing on a Subplot 294 Saving Plots to File 296 matplotlib Configuration 297 9.2 Plotting with pandas and seaborn 298 Line Plots 298 Bar Plots 301 Histograms and Density Plots 309 Scatter or Point Plots 311 Facet Grids and Categorical Data 314 9.3 Other Python Visualization Tools 317 9.4 Conclusion 317 10. Data Aggregation and Group Operations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319 10.1 How to Think About Group Operations 320 Iterating over Groups 324 Selecting a Column or Subset of Columns 326 Grouping with Dictionaries and Series 327 Grouping with Functions 328 Grouping by Index Levels 328 10.2 Data Aggregation 329 Column-Wise and Multiple Function Application 331 Returning Aggregated Data Without Row Indexes 335 10.3 Apply: General split-apply-combine 335 Suppressing the Group Keys 338 Quantile and Bucket Analysis 338 Example: Filling Missing Values with Group-Specific Values 340 Example: Random Sampling and Permutation 343 Example: Group Weighted Average and Correlation 344 Table of Contents | vii Example: Group-Wise Linear Regression 347 10.4 Group Transforms and “Unwrapped” GroupBys 347 10.5 Pivot Tables and Cross-Tabulation 351 Cross-Tabulations: Crosstab 354 10.6 Conclusion 355 11. Time Series. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357 11.1 Date and Time Data Types and Tools 358 Converting Between String and Datetime 359 11.2 Time Series Basics 361 Indexing, Selection, Subsetting 363 Time Series with Duplicate Indices 365 11.3 Date Ranges, Frequencies, and Shifting 366 Generating Date Ranges 367 Frequencies and Date Offsets 370 Shifting (Leading and Lagging) Data 371 11.4 Time Zone Handling 374 Time Zone Localization and Conversion 375 Operations with Time Zone-Aware Timestamp Objects 377 Operations Between Different Time Zones 378 11.5 Periods and Period Arithmetic 379 Period Frequency Conversion 380 Quarterly Period Frequencies 382 Converting Timestamps to Periods (and Back) 384 Creating a PeriodIndex from Arrays 385 11.6 Resampling and Frequency Conversion 387 Downsampling 388 Upsampling and Interpolation 391 Resampling with Periods 392 Grouped Time Resampling 394 11.7 Moving Window Functions 396 Exponentially Weighted Functions 399 Binary Moving Window Functions 401 User-Defined Moving Window Functions 402 11.8 Conclusion 403 12. Introduction to Modeling Libraries in Python. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405 12.1 Interfacing Between pandas and Model Code 405 12.2 Creating Model Descriptions with Patsy 408 Data Transformations in Patsy Formulas 410 Categorical Data and Patsy 412 viii | Table of Contents

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.