ebook img

Thoughtful Data Science PDF

491 Pages·2018·12.69 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Thoughtful Data Science

Thoughtful Data Science A Programmer's Toolset for Data Analysis and Artificial Intelligence with Python, Jupyter Notebook,  and PixieDust David Taieb BIRMINGHAM - MUMBAI Thoughtful Data Science Copyright © 2018 Packt Publishing All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews. Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book. Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information. Acquisition Editors: Frank Pohlmann, Suresh M Jain Project Editors: Savvy Sequeira, Kishor Rit Content Development Editor: Alex Sorrentino Technical Editor: Bhagyashree Rai Proofreader: Safis Editing Indexers: Priyanka Dhadke Graphics: Tom Scaria Production Coordinator: Sandip Tadge First published: July 2018 Production reference: 1300718 Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK. ISBN 978-1-78883-996-9 www.packtpub.com To Alexandra, Solomon, Zachary, Victoria and Charlotte: Thank you for your support, unbounded love, and infinite patience. I would not have been able to complete this work without all of you. To Fernand and Gisele: Without whom I wouldn't be where I am today. Thank you for your continued guidance all these years. mapt.io Mapt is an online digital library that gives you full access to over 5,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website. Why subscribe? • Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals • Learn better with Skill Plans built especially for you • Get a free eBook or video every month • Mapt is fully searchable • Copy and paste, print, and bookmark content PacktPub.com Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub. com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details. At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks. Contributors About the author David Taieb is the Distinguished Engineer for the Watson and Cloud Platform Developer Advocacy team at IBM, leading a team of avid technologists on a mission to educate developers on the art of the possible with data science, AI and cloud technologies. He's passionate about building open source tools, such as the PixieDust Python Library for Jupyter Notebooks, which help improve developer productivity and democratize data science. David enjoys sharing his experience by speaking at conferences and meetups, where he likes to meet as many people as possible. I want to give special thanks to all of the following dear friends at IBM who contributed to the development of PixieDust and/ or provided invaluable support during the writing of this book: Brad Noble, Jose Barbosa, Mark Watson, Raj Singh, Mike Broberg, Jessica Mantaro, Margriet Groenendijk, Patrick Titzler, Glynn Bird, Teri Chadbourne, Bradley Holt, Adam Cox, Jamie Jennings, Terry Antony, Stephen Badolato, Terri Gerber, Peter May, Brady Paterson, Kathleen Francis, Dan O'Connor, Muhtar (Burak) Akbulut, Navneet Rao, Panos Karagiannis, Allen Dean, and Jim Young. About the reviewers Margriet Groenendijk is a data scientist and developer advocate for IBM. She has a background in climate research, where, at the University of Exeter, she explored large observational datasets and the output of global scale weather and climate models to understand the impact of land use on climate. Prior to that, she explored the effect of climate on the uptake of carbon from the atmosphere by forests during her PhD research at the Vrije Universiteit in Amsterdam. Now adays, she explores ways to simplify working with diverse data using open source tools, IBM Cloud, and Watson Studio. She has experience with cloud services, databases, and APIs to access, combine, clean, and store different types of data. Margriet uses time series analysis, statistical data analysis, modeling and parameter optimisation, machine learning, and complex data visualization. She writes blogs and speaks about these topics at conferences and meetups. va barbosa is a developer advocate for the Center for Open-Source Data & AI Technologies, where he helps developers discover and make use of data and machine learning technologies. This is fueled by his passion to help others, and guided by his enthusiasm for open source technology. Always looking to embrace new challenges and fulfill his appetite for learning, va immerses himself in a wide range of technologies and activities. He has been an electronic technician, support engineer, software engineer, and developer advocate. When not focusing on the developer experience, va enjoys dabbling in photography. If you can't find him in front of a computer, try looking behind a camera. Packt is searching for authors like you If you're interested in becoming an author for Packt, please visit authors.packtpub. com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea. Table of Contents Preface V Chapter 1: Perspectives on Data Science from a Developer 1 What is data science 1 Is data science here to stay? 2 Why is data science on the rise? 3 What does that have to do with developers? 4 Putting these concepts into practice 6 Deep diving into a concrete example 7 Data pipeline blueprint 8 What kind of skills are required to become a data scientist? 10 IBM Watson DeepQA 12 Back to our sentiment analysis of Twitter hashtags project 15 Lessons learned from building our first enterprise-ready data pipeline 19 Data science strategy 20 Jupyter Notebooks at the center of our strategy 22 Why are Notebooks so popular?  23 Summary 25 Chapter 2: Data Science at Scale with Jupyter Notebooks and PixieDust 27 Why choose Python? 28 Introducing PixieDust 32 SampleData – a simple API for loading data 36 Wrangling data with pixiedust_rosie 42 Display – a simple interactive API for data visualization 49 Filtering 60 Bridging the gap between developers and data scientists with PixieApps 63 Architecture for operationalizing data science analytics 67 Summary 72 Chapter 3: PixieApp under the Hood 73 Anatomy of a PixieApp 74 Routes 76 Generating requests to routes 79 A GitHub project tracking sample application  80 [ i ] Table of Contents Displaying the search results in a table  84 Invoking the PixieDust display() API using pd_entity attribute  92 Invoking arbitrary Python code with pd_script  100 Making the application more responsive with pd_refresh  105 Creating reusable widgets  107 Summary 108 Chapter 4: Deploying PixieApps to the Web with the PixieGateway Server 109 Overview of Kubernetes 110 Installing and configuring the PixieGateway server 112 PixieGateway server configuration  116 PixieGateway architecture  120 Publishing an application  124 Encoding state in the PixieApp URL  128 Sharing charts by publishing them as web pages  129 PixieGateway admin console  134 Python Console  137 Displaying warmup and run code for a PixieApp  138 Summary 139 Chapter 5: Best Practices and Advanced PixieDust Concepts 141 Use @captureOutput decorator to integrate the output of third-party Python libraries 142 Create a word cloud image with @captureOutput  142 Increase modularity and code reuse 145 Creating a widget with pd_widget  148 PixieDust support of streaming data  150 Adding streaming capabilities to your PixieApp  153 Adding dashboard drill-downs with PixieApp events  156 Extending PixieDust visualizations  161 Debugging  169 Debugging on the Jupyter Notebook using pdb  169 Visual debugging with PixieDebugger  173 Debugging PixieApp routes with PixieDebugger  176 Troubleshooting issues using PixieDust logging  178 Client-side debugging  181 Run Node.js inside a Python Notebook 183 Summary 188 Chapter 6: Image Recognition with TensorFlow 189 What is machine learning? 190 What is deep learning? 192 Getting started with TensorFlow 195 [ ii ] Table of Contents Simple classification with DNNClassifier  199 Image recognition sample application 211 Part 1 – Load the pretrained MobileNet model  212 Part 2 – Create a PixieApp for our image recognition sample  application  220 Part 3 – Integrate the TensorBoard graph visualization  224 Part 4 – Retrain the model with custom training data  230 Summary 242 Chapter 7: Big Data Twitter Sentiment Analysis 243 Getting started with Apache Spark 244 Apache Spark architecture  244 Configuring Notebooks to work with Spark  246 Twitter sentiment analysis application 248 Part 1 – Acquiring the data with Spark Structured Streaming 249 Architecture diagram for the data pipeline  249 Authentication with Twitter  250 Creating the Twitter stream  251 Creating a Spark Streaming DataFrame  255 Creating and running a structured query  258 Monitoring active streaming queries  260 Creating a batch DataFrame from the Parquet files  262 Part 2 – Enriching the data with sentiment and most relevant extracted entity 265 Getting started with the IBM Watson Natural Language Understanding  service  265 Part 3 – Creating a real-time dashboard PixieApp 273 Refactoring the analytics into their own methods  274 Creating the PixieApp  276 Part 4 – Adding scalability with Apache Kafka and IBM Streams Designer 286 Streaming the raw tweets to Kafka  288 Enriching the tweets data with the Streaming Analytics service  291 Creating a Spark Streaming DataFrame with a Kafka input source  298 Summary 302 Chapter 8: Financial Time Series Analysis and Forecasting 303 Getting started with NumPy 304 Creating a NumPy array  307 Operations on ndarray  310 Selections on NumPy arrays  312 Broadcasting  313 [ iii ]

Description:
Thoughtful Data Science brings new strategies and a carefully crafted programmer's toolset to work with modern, cutting-edge data analysis. This new approach is designed specifically to give developers more efficiency and power to create cutting-edge data analysis and artificial intelligence insight
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.