Table Of ContentPandas Cookbook TP Theodore Petrou
h
ea
o
dn
o
d
r
e
a
P
es
Pandas is one of the most powerful, Things you will learn: t
r
oC
fl exible, and effi cient scientifi c computing
u
o
packages in Python. With this book,
• Master the fundamentals of pandas to o
you will explore data in pandas through
quickly begin exploring any dataset k Pandas
dozens of practice problems with detailed
b
solutions in iPython notebooks.
• Isolate any subset of data by properly
o
selecting and querying the data
o
This book will provide you with clean, clear k
• Split data into independent groups
recipes, and solutions that explain how to
before applying aggregations and
handle common data manipulation and
transformations to each group
scientifi c computing tasks with pandas. Cookbook
You will work with different types of • Restructure data into a tidy form
datasets, and perform data manipulation to make data analysis and
and data wrangling effectively. You will visualization easier
explore the power of pandas DataFrames
• Prepare messy real-world datasets
and fi nd out about boolean and multi-
for machine learning
indexing. Tasks related to statistical
and time series computations, and how
• Combine and merge data from
to implement them in fi nancial and
different sources through pandas
scientifi c applications are also covered in SQL-like operations
this book.
Recipes for Scientifi c Computing, Time Series Analysis
• Utilize pandas unparalleled time
series functionality
and Data Visualization using Python
By the end of this book, you will have
all the knowledge you need to master • Create beautiful and insightful
pandas, and perform fast and accurate visualizations through pandas direct
hooks to matplotlib and seaborn
scientifi c computing.
www.packtpub.com
Pandas Cookbook
Recipes for Scientific Computing, Time Series Analysis and
Data Visualization using Python
Theodore Petrou
BIRMINGHAM - MUMBAI
Pandas Cookbook
Copyright © 2017 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or
transmitted in any form or by any means, without the prior written permission of the
publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the
information presented. However, the information contained in this book is sold without
warranty, either express or implied. Neither the author, nor Packt Publishing, and its
dealers and distributors will be held liable for any damages caused or alleged to be caused
directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the
companies and products mentioned in this book by the appropriate use of capitals.
However, Packt Publishing cannot guarantee the accuracy of this information.
First published: October 2017
Production reference: 1181017
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham
B3 2PB, UK.
ISBN 978-1-78439-387-8
www.packtpub.com
Credits
Author Copy Editor
Theodore Petrou Tasneem Fatehi
Reviewers Project Coordinator
Sonali Dayal Manthan Patel
Kuntal Ganguly
Shilpi Saxena
Commissioning Editor Proofreader
Veena Pagare Safis Editing
Acquisition Editor Indexer
Tushar Gupta Tejal Daruwale Soni
Content Development Editor Graphics
Snehal Kolte Tania Dutta
Technical Editor Production Coordinator
Sayli Nikalje Deepika Naik
About the Author
Theodore Petrou is a data scientist and the founder of Dunder Data, a professional
educational company focusing on exploratory data analysis. He is also the head of Houston
Data Science, a meetup group with more than 2,000 members that has the primary goal of
getting local data enthusiasts together in the same room to practice data science. Before
founding Dunder Data, Ted was a data scientist at Schlumberger, a large oil services
company, where he spent the vast majority of his time exploring data.
Some of his projects included using targeted sentiment analysis to discover the root cause of
part failure from engineer text, developing customized client/server dashboarding
applications, and real-time web services to avoid the mispricing of sales items. Ted received
his masters degree in statistics from Rice University, and used his analytical skills to play
poker professionally and teach math before becoming a data scientist. Ted is a strong
supporter of learning through practice and can often be found answering questions about
pandas on Stack Overflow.
Acknowledgements
I would first like to thank my wife, Eleni, and two young children, Penelope, and Niko, who
endured extended periods of time without me as I wrote.
I’d also like to thank Sonali Dayal, whose constant feedback helped immensely in
structuring the content of the book to improve its effectiveness. Thank you to Roy Keyes,
who is the most exceptional data scientist I know and whose collaboration made Houston
Data Science possible. Thank you to Scott Boston, an extremely skilled pandas user for
developing ideas for recipes. Thank you very much to Kim Williams, Randolph Adami,
Kevin Higgins, and Vishwanath Avasarala, who took a chance on me during my
professional career when I had little to no experience. Thanks to my fellow coworker at
Schlumberger, Micah Miller, for his critical, honest, and instructive feedback on anything
that we developed together and his constant pursuit to move toward Python.
Thank you to Phu Ngo, who critically challenges and sharpens my thinking more than
anyone. Thank you to my brother, Dean Petrou, for being right by my side as we developed
our analytical skills through poker and again through business. Thank you to my sister,
Stephanie Burton, for always knowing what I’m thinking and making sure that I’m aware of
it. Thank you to my mother, Sofia Petrou, for her ceaseless love, support, and endless math
puzzles that challenged me as a child. And thank you to my father, Steve Petrou, who,
although no longer here, remains close to my heart and continues to encourage me every
day.
About the Reviewers
Sonali Dayal is a masters candidate in biostatistics at the University of California, Berkeley.
Previously, she has worked as a freelance software and data science engineer for early stage
start-ups, where she built supervised and unsupervised machine learning models as well as
data pipelines and interactive data analytics dashboards. She received her bachelor of
science (B.S.) in biochemistry from Virginia Tech in 2011.
Kuntal Ganguly is a big data machine learning engineer focused on building large-scale
data-driven systems using big data frameworks and machine learning. He has around 7
years of experience building several big data and machine learning applications.
Kuntal provides solutions to AWS customers in building real-time analytics systems using
managed cloud services and open source Hadoop ecosystem technologies such as Spark,
Kafka, Storm, Solr, and so on, along with machine learning and deep learning frameworks
such as scikit-learn, TensorFlow, Keras, and BigDL. He enjoys hands-on software
development, and has single-handedly conceived, architectured, developed, and deployed
several large scale distributed applications. He is a machine learning and deep learning
practitioner and very passionate about building intelligent applications.
Kuntal is the author of the books: Learning Generative Adversarial Network and R Data
Analysis Cookbook - Second Edition, Packt Publishing.
Shilpi Saxena is a seasoned professional who leads in management with an edge of being a
technology evangelist--she is an engineer who has exposure to a variety of domains
(machine-to-machine space, healthcare, telecom, hiring, and manufacturing). She has
experience in all aspects of the conception and execution of enterprise solutions. She has
been architecturing, managing, and delivering solutions in the big data space for the last 3
years, handling high performance geographically distributed teams of elite engineers. Shilpi
has around 12+ years (3 years in the big data space) experience in the development and
execution of various facets of enterprise solutions, both in the product/services dimensions
of the software industry. An engineer by degree and profession who has worn various hats-
-developer, technical leader, product owner, tech manager--and has seen all the flavors that
the industry has to offer. She has architectured and worked through some of the pioneer
production implementation in big data on Storm and Impala with auto scaling in AWS.
LinkedIn: http://in.linkedin.com/pub/shilpi-saxena/4/552/a30
www.PacktPub.com
For support files and downloads related to your book, please visit www.PacktPub.com. Did
you know that Packt offers eBook versions of every book published, with PDF and ePub
files available? You can upgrade to the eBook version at www.PacktPub.com and as a print
book customer, you are entitled to a discount on the eBook copy. Get in touch with us at
service@packtpub.com for more details. At www.PacktPub.com, you can also read a
collection of free technical articles, sign up for a range of free newsletters and receive
exclusive discounts and offers on Packt books and eBooks.
https://www.packtpub.com/mapt Get the most in-demand software skills with Mapt. Mapt
gives you full access to all Packt books and video courses, as well as industry-leading tools
to help you plan your personal development and advance your career.
Why subscribe?
Fully searchable across every book published by Packt
Copy and paste, print, and bookmark content
On demand and accessible via a web browser
Customer Feedback
Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial
process. To help us improve, please leave us an honest review on this book's Amazon page
at https://www.amazon.com/dp/1784393878. If you'd like to join our team of regular
reviewers, you can email us at customerreviews@packtpub.com. We award our regular
reviewers with free eBooks and videos in exchange for their valuable feedback. Help us be
relentless in improving our products!
Table of Contents
Preface
1
Chapter 1: Pandas Foundations
15
Introduction 15
Dissecting the anatomy of a DataFrame 16
Getting ready 16
How to do it... 16
How it works... 17
There's more... 18
See also 18
Accessing the main DataFrame components 18
Getting ready 18
How to do it... 19
How it works... 20
There's more... 21
See also 21
Understanding data types 22
Getting ready 23
How to do it... 23
How it works... 23
There's more... 24
See also 24
Selecting a single column of data as a Series 24
Getting ready 24
How to do it... 25
How it works... 25
There's more... 26
See also 27
Calling Series methods 27
Getting ready 28
How to do it... 28
How it works... 32
There's more... 33
See also 34
Working with operators on a Series 34