Mastering Machine Learning on AWS Advanced machine learning in Python using SageMaker, Apache Spark, and TensorFlow Dr. Saket S.R. Mengle Maximo Gurmendez BIRMINGHAM - MUMBAI Mastering Machine Learning on AWS Copyright © 2019 Packt Publishing All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews. Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book. Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information. Commissioning Editor: Sunith Shetty Acquisition Editor: Devika Battike Content Development Editor: Nathanya Dias Technical Editor: Utkarsha S. Kadam Copy Editor: Safis Editing Project Coordinator: Kirti Pisat Proofreader: Safis Editing Indexer: Priyanka Dhadke Graphics: Jisha Chirayil Production Coordinator: Shraddha Falebhai First published: May 2019 Production reference: 1150519 Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK. ISBN 978-1-78934-979-5 www.packtpub.com I would like to dedicate this book in memory of my dad. Thanks for being there for me and supporting my dreams. – Dr. Saket S.R. Mengle This book is dedicated to Mateo and Paulina, who are my constant source of inspiration, joy and purpose. – Maximo Gurmendez mapt.io Mapt is an online digital library that gives you full access to over 5,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website. Why subscribe? Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals Improve your learning with Skill Plans built especially for you Get a free eBook or video every month Mapt is fully searchable Copy and paste, print, and bookmark content Packt.com Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.packt.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details. At www.packt.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks. Contributors About the authors Dr. Saket S.R. Mengle holds a PhD in text mining from Illinois Institute of Technology, Chicago. He has worked in a variety of fields, including text classification, information retrieval, large-scale machine learning, and linear optimization. He currently works as senior principal data scientist at dataxu, where he is responsible for developing and maintaining the algorithms that drive dataxu's real-time advertising platform. I would like to thank my wife, Sharvari, who gives me strength and inspires me to be the best version of myself every day. This book would have not been possible without her love and support. I would also like to thank my parents, Subhash and Rashmi Mengle, who taught me the value of hard work. I would like to express my appreciation to my advisor, Dr. Nazli Goharian, and Dr. Ophir Frieder, who introduced me to the world of Machine Learning. Maximo Gurmendez holds a master's degree in computer science/AI from Northeastern University, where he attended as a Fulbright Scholar. Since 2009, he has been working with dataxu as data science engineering lead. He's also the founder of Montevideo Labs (a data science and engineering consultancy). Additionally, Maximo is a computer science professor at the University of Montevideo and is director of its data science for business program. I'd like to deeply thank my wife Maggie for her sustained support, encouragement and patience, especially throughout the long work days and busy weekends that writing this book implied. Additionally, I’d like to thank my mother, Margarita, who taught me the importance of learning, caring and hard-work through her own example. Finally, I’d like to express my gratitude to the dataxu team from whom I learned so much in the past ten years. About the reviewer Chirag Nayyar helps organizations initiate their digital transformation using the public cloud. He has been actively working on cloud platforms since 2013, providing consultancy to many organizations, ranging from small and mid-size businesses to enterprises. He holds a wide range of certifications from all major public cloud platforms. He also runs a meet-up group and is a regular speaker at various cloud events. He has also reviewed Hands-On Machine Learning on Google Cloud Platform and Google Cloud Platform Cookbook, by Packt Publishing. Packt is searching for authors like you If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea. Table of Contents Preface 1 Section 1: Machine Learning on AWS Chapter 1: Getting Started with Machine Learning for AWS 9 How AWS empowers data scientists 9 Using AWS tools for machine learning 11 Identifying candidate problems that can be solved using machine learning 12 Machine learning project life cycle 13 Data gathering 13 Evaluation metrics 14 Algorithm selection 14 Deploying models 15 Summary 15 Exercise 15 Section 2: Implementing Machine Learning Algorithms at Scale on AWS Chapter 2: Classifying Twitter Feeds with Naive Bayes 17 Classification algorithms 18 Feature types 19 Nominal features 19 Ordinal features 19 Continuous features 19 Naive Bayes classifier 20 Bayes' theorem 20 Posterior 21 Likelihood 21 Prior probability 21 Evidence 22 How the Naive Bayes algorithm works 22 Classifying text with language models 24 Collecting the tweets 24 Preparing the data 25 Building a Naive Bayes model through SageMaker notebooks 26 Naïve Bayes model on SageMaker notebooks using Apache Spark 33 Using SageMaker's BlazingText built-in ML service 36 Naive Bayes – pros and cons 40 Table of Contents Summary 41 Exercises 42 Chapter 3: Predicting House Value with Regression Algorithms 43 Predicting the price of houses 43 Understanding linear regression 44 Linear least squares estimation 46 Maximum likelihood estimation 47 Gradient descent 47 Evaluating regression models 48 Mean absolute error 48 Mean squared error 49 Root mean squared error 49 R-squared 49 Implementing linear regression through scikit-learn 50 Implementing linear regression through Apache Spark 53 Implementing linear regression through SageMaker's linear Learner 55 Understanding logistic regression 59 Logistic regression in Spark 59 Pros and cons of linear models 60 Summary 60 Chapter 4: Predicting User Behavior with Tree-Based Methods 61 Understanding decision trees 61 Recursive splitting 63 Types of decision trees 63 Cost functions 64 Gini Impurity 64 Information gain 66 Criteria to stop splitting trees 67 Understanding random forest algorithms 68 Understanding gradient boosting algorithms 69 Predicting clicks on log streams 69 Introduction to Elastic MapReduce (EMR) 70 Training with Apache Spark on EMR 73 Getting the data 74 Preparing the data 74 Categorical encoding 78 One-hot encoding 78 Training a model 81 Evaluating our model 82 Area Under ROC Curve 83 Area under the precision-recall curve 84 Training tree ensembles on EMR 86 Training gradient-boosted trees with the SageMaker services 87 Preparing the data 87 Training with SageMaker XGBoost 88 [ ii ] Table of Contents Applying and evaluating the model 90 Summary 93 Exercises 94 Chapter 5: Customer Segmentation Using Clustering Algorithms 95 Understanding How Clustering Algorithms Work 95 k-means clustering 97 Euclidean distance 99 Manhattan distance 99 Hierarchical clustering 99 Agglomerative clustering 99 Divisive clustering 101 Clustering with Apache Spark on EMR 102 Clustering with Spark and SageMaker on EMR 111 Understanding the purpose of the IAM role 114 Summary 118 Exercises 118 Chapter 6: Analyzing Visitor Patterns to Make Recommendations 119 Making theme park attraction recommendations through Flickr data 119 Collaborative filtering 120 Memory-based approach 120 Model-based approach 121 Matrix factorization 121 Stochastic gradient descent 123 Alternating Least Squares 123 Finding recommendations through Apache Spark's ALS 124 Data gathering and exploration 124 Training the model 127 Getting recommendations 128 Recommending attractions through SageMaker Factorization Machines 131 Preparing the dataset for learning 131 Training the model 135 Getting recommendations 137 Summary 139 Exercises 140 Section 3: Deep Learning Chapter 7: Implementing Deep Learning Algorithms 142 Understanding deep learning 142 Applications of deep learning 143 Self-driving cars 144 Learning to play video games using a deep learning algorithm 145 Understanding deep learning algorithms 145 [ iii ]