Python Machine Learning Blueprints Second Edition Put your machine learning concepts to the test by developing real-world smart projects Alexander Combs Michael Roman BIRMINGHAM - MUMBAI Python Machine Learning Blueprints Second Edition Copyright © 2019 Packt Publishing All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews. Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book. Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information. Commissioning Editor: Sunith Shetty Acquisition Editor: Varsha Shetty Content Development Editor: Snehal Kolte Technical Editor: Naveen Sharma Copy Editor: Safis Editing Project Coordinator: Manthan Patel Proofreader: Safis Editing Indexer: Mariammal Chettiyar Graphics: Jisha Chirayil Production Coordinator: Arvindkumar Gupta First published: July 2016 Second edition: January 2019 Production reference: 1310119 Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK. ISBN 978-1-78899-417-0 www.packtpub.com mapt.io Mapt is an online digital library that gives you full access to over 5,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website. Why subscribe? Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals Improve your learning with Skill Plans built especially for you Get a free eBook or video every month Mapt is fully searchable Copy and paste, print, and bookmark content Packt.com Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.packt.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details. At www.packt.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks. Contributors About the authors Alexander Combs is an experienced data scientist, strategist, and developer with a background in financial data extraction, natural language processing and generation, and quantitative and statistical modeling. He currently lives and works in New York City. Writing a book is truly a massive undertaking that would not be possible without the support of others. I would like to thank my family for their love and encouragement and Jocelyn for her patience and understanding. I owe all of you tremendously. Michael Roman is a data scientist at The Atlantic, where he designs, tests, analyzes, and productionizes machine learning models to address a range of business topics. Prior to this he was an associate instructor at a full-time data science immersive program in New York City. His interests include computer vision, propensity modeling, natural language processing, and entrepreneurship. About the reviewer Saurabh Chhajed is a machine learning and big data engineer with 9 years of professional experience in the enterprise application development life cycle using the latest frameworks, tools, and design patterns. He has experience of designing and implementing some of the most widely used and scalable customer-facing recommendation systems with extensive usage of the big data ecosystem— the batch, real-time, and machine learning pipeline. He has also worked for some of the largest investment banks, credit card companies, and manufacturing companies around the world, implementing a range of robust and scalable product suites. Packt is searching for authors like you If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea. Table of Contents Preface 1 Chapter 1: The Python Machine Learning Ecosystem 6 Data science/machine learning workflow 7 Acquisition 8 Inspection 8 Preparation 9 Modeling 9 Evaluation 9 Deployment 10 Python libraries and functions for each stage of the data science workflow 10 Acquisition 10 Inspection 11 The Jupyter Notebook 11 Pandas 13 Visualization 21 The matplotlib library 23 The seaborn library 29 Preparation 32 map 33 apply 34 applymap 35 groupby 36 Modeling and evaluation 38 Statsmodels 38 Scikit-learn 42 Deployment 47 Setting up your machine learning environment 48 Summary 48 Chapter 2: Build an App to Find Underpriced Apartments 49 Sourcing apartment listing data 50 Pulling down listing data 50 Pulling out the individual data points 57 Parsing data 61 Inspecting and preparing the data 64 Sneak-peek at the data types 66 Visualizing our data 69 Visualizing the data 77 Modeling the data 79 Table of Contents Forecasting 82 Extending the model 86 Summary 86 Chapter 3: Build an App to Find Cheap Airfares 87 Sourcing airfare pricing data 88 Retrieving fare data with advanced web scraping 89 Creating a link 91 Parsing the DOM to extract pricing data 93 Parsing 96 Identifying outlier fares with anomaly detection techniques 106 Sending real-time alerts using IFTTT 113 Putting it all together 118 Summary 121 Chapter 4: Forecast the IPO Market Using Logistic Regression 122 The IPO market 123 What is an IPO? 123 Recent IPO market performance 124 Working on the DataFrame 126 Analyzing the data 131 Summarizing the performance of the stocks 132 Baseline IPO strategy 136 Data cleansing and feature engineering 139 Adding features to influence the performance of an IPO 139 Binary classification with logistic regression 141 Creating the target for our model 143 Dummy coding 143 Examining the model performance 145 Generating the importance of a feature from our model 149 Random forest classifier method 151 Summary 153 Chapter 5: Create a Custom Newsfeed 154 Creating a supervised training set with Pocket 155 Installing the Pocket Chrome Extension 155 Using the Pocket API to retrieve stories 157 Using the Embedly API to download story bodies 164 Basics of Natural Language Processing 166 Support Vector Machines 169 IFTTT integration with feeds, Google Sheets, and email 172 Setting up news feeds and Google Sheets through IFTTT 172 Setting up your daily personal newsletter 182 Summary 188 Chapter 6: Predict whether Your Content Will Go Viral 189 [ ii ] Table of Contents What does research tell us about virality? 190 Sourcing shared counts and content 192 Exploring the features of shareability 195 Exploring image data 196 Clustering 199 Exploring the headlines 202 Exploring the story content 207 Building a predictive content scoring model 210 Evaluating the model 213 Adding new features to our model 216 Summary 218 Chapter 7: Use Machine Learning to Forecast the Stock Market 219 Types of market analysis 221 What does research tell us about the stock market? 221 So, what exactly is a momentum strategy? 222 How to develop a trading strategy 223 Analysis of the data 225 Volatility of the returns 227 Daily returns 229 Statistics for the strategies 230 The mystery strategy 232 Building the regression model 237 Performance of the model 239 Dynamic time warping 246 Evaluating our trades 249 Summary 250 Chapter 8: Classifying Images with Convolutional Neural Networks 251 Image-feature extraction 252 Convolutional neural networks 255 Network topology 256 Convolutional layers and filters 258 Max pooling layers 268 Flattening 270 Fully-connected layers and output 271 Building a convolutional neural network to classify images in the Zalando Research dataset, using Keras 272 Summary 290 Chapter 9: Building a Chatbot 291 The Turing Test 292 The history of chatbots 293 The design of chatbots 298 Building a chatbot 303 [ iii ] Table of Contents Sequence-to-sequence modeling for chatbots 309 Summary 320 Chapter 10: Build a Recommendation Engine 321 Collaborative filtering 322 So, what's collaborative filtering? 322 Predicting the rating for the product 325 Content-based filtering 329 Hybrid systems 330 Collaborative filtering 330 Content-based filtering 330 Building a recommendation engine 331 Summary 348 Chapter 11: What's Next? 349 Summary of the projects 349 Summary 353 Other Books You May Enjoy 354 Index 357 [ iv ]