Production-Ready Applied Deep Learning Learn how to construct and deploy complex models in PyTorch and TensorFlow deep learning frameworks Tomasz Palczewski Jaejun (Brandon) Lee Lenin Mookiah BIRMINGHAM—MUMBAI Production-Ready Applied Deep Learning Copyright © 2022 Packt Publishing All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews. Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book. Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information. Publishing Product Manager: Ali Abidi Senior Editor: Nazia Shaikh Content Development Editor: Shreya Moharir Technical Editor: Rahul Limbachiya Copy Editor: Safis Editing Project Coordinator: Farheen Fathima Proofreader: Safis Editing Indexer: Rekha Nair Production Designer: Aparna Bhagat Marketing Coordinators: Shifa Ansari and Abeer Dawe First published: September 2022 Production reference: 1260822 Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK. ISBN 978-1-80324-366-5 www.packt.com To Sylwia, Anna, and Matt – my loves, my life. To my Mom, my brother Piotr, and my family. - Tomasz To my parents, Changhee and Kyung Ja, for always loving and supporting me. - Jaejun To my mom, Chendurkani, for her unconditional support and encouragement. - Lenin Finally, we would like to dedicate this book to self-motivated and value-driven individuals who put their time into learning new technologies to make the world more exciting. Contributors About the authors Tomasz Palczewski is currently working as a staff software engineer at Samsung Research America (SRA). He has a Ph.D. in physics and an eMBA degree from Quantic. His zeal for getting insights out of large datasets using cutting-edge techniques led him to work across the globe at CERN (Switzerland), LBNL (Italy), J-PARC (Japan), University of Alabama (US), and the University of California, Berkeley (US). In 2016, he was deployed to the South Pole to calibrate the world’s largest neutrino telescope. At some point, he decided to pivot his career and focus on applying his skills in industry. Currently, Dr. Palczewski works on modeling user behavior and creating value for advertising and marketing verticals by deploying machine learning (ML), deep learning, and statistical models at scale. I had the idea of writing a book that my younger self would appreciate. The book would show different aspects of production-ready deep learning. I am grateful that Jaejun and Lenin were excited about the idea and joined the project. Without their help, this would not have turned out as it did. Finally, I would like to thank my wife for all her support. Jaejun (Brandon) Lee is currently working as an AI research lead at RoboEye.ai, integrating cutting- edge algorithms in computer vision and AI into industrial automation solutions. He obtained his master’s degree from the University of Waterloo with research focused on natural language processing (NLP), specifically speech recognition. He has spent many years developing a fully productionized yet open source wake word detection toolkit with a web browser deployment target, Howl. Luckily, his effort has been picked up by Mozilla’s Firefox Voice and it is actively providing a completely hands- free experience to many users all over the world. I would like to thank Tomasz for offering this remarkable opportunity to become an author. Next, I am really grateful to Lenin for sharing his knowledge of data engineering throughout our journey. Lastly, I would like to thank Erica for her encouragement. Lenin Mookiah is a machine learning engineer who has worked with reputed tech companies – Samsung Research America, eBay Inc., and Adobe R&D. He has worked in the technology industry for over 11 years in various domains: banking, retail, eDiscovery, and media. He has played various roles in the end-to-end productization of large-scale machine learning systems. He mainly employs the big data ecosystem to build reliable feature pipelines that data scientists consume. Apart from his industrial experience, he researched anomaly detection in his Ph.D. at Tennessee Tech University (US) using a novel graph-based approach. He studied entity resolution on social networks during his master’s at Tsinghua University, China. Working with Tomasz and Jaejun was very exciting. I sincerely thank both for the collaboration on this book. I have learned many aspects of data science from both. About the reviewers Utkarsh Srivastava is an AI/ML professional, trainer, YouTuber, and blogger. He loves to tackle and develop ML, NLP, and computer vision algorithms to solve complex problems. He started his data science career as a blogger of his own blog (datamahadev.com) and YouTube channel (datamahadev), followed by working as a senior data science trainer in an institute in Gujarat. Additionally, he has trained and counseled 1,000+ working professionals and students in AI/ML. Utkarsh has successfully completed 40+ freelance training and development work/projects in data science and analytics, AI/ ML, Python development, and SQL. He hails from Lucknow and is currently settled in Bangalore, India, as an analyst at Deloitte USI Consulting. I would like to thank my mother, Mrs. Rupam Srivastava, for her continuous guidance and support throughout my hardships and struggles. Thanks also to the Supreme Para-Brahman. Neeraj Jhaveri is a cloud solution architect at Microsoft with expertise in providing data and AI solutions. He has around 20 years of IT experience. Over the last decade, working on data and analytics, he has provided AI architect solutions on Azure. Using Azure ML and Cognitive Services, he has helped customers move to Azure using the latest technologies. He received a master’s degree in computer science from NYIT. He provides frequent tech talks for the fast-tracking implementation of AI solutions in Azure. Pooya Rezaei is an ML software engineer at Google using machine learning to estimate offline conversions from Google Ads. Previously, he was an ML engineer at Iterable for two years optimizing their email marketing automation platform to maximize reach. He received a B.Sc. from the University of Tehran, an M.Sc. from the Sharif University of Technology, and a Ph.D. from the University of Vermont, all in electrical and computer engineering. Table of Contents Preface xiii Part 1 – Building a Minimum Viable Product 1 Effective Planning of Deep Learning-Driven Projects 3 Technical requirements 3 Planning a DL project 9 What is DL? 3 Defining goal and evaluation metrics 9 Understanding the role of DL in our Stakeholder identification 11 daily lives 5 Task organization 12 Resource allocation 13 Overview of DL projects 7 Defining a timeline 14 Project planning 7 Managing a project 15 Building minimum viable products 7 Building fully featured products 8 Summary 16 Deployment and maintenance 8 Further reading 17 Project evaluation 8 2 Data Preparation for Deep Learning Projects 19 Technical requirements 20 Collecting data 24 Setting up notebook environments 20 Cleaning data 27 Data preprocessing 31 Setting up a Python environment 20 Installing Anaconda 20 Extracting features from data 35 Setting up a DL project using Anaconda 21 Converting text using bag-of-words 35 Data collection, data cleaning, and Applying term frequency-inverse document frequency (TF-IDF) transformation 36 data preprocessing 23 viii Table of Contents Creating one-hot encoding (one-of-k) 37 Performing data visualization 42 Creating ordinal encoding 38 Performing basic visualizations using Matplotlib 43 Converting a colored image into a grayscale Drawing statistical graphs using Seaborn 45 image 39 Introduction to Docker 47 Performing dimensionality reduction 39 Introduction to Dockerfiles 47 Applying fuzzy matching to handle similarity between strings 41 Building a custom Docker image 48 Summary 48 3 Developing a Powerful Deep Learning Model 49 Technical requirements 49 PyTorch model training 64 Going through the basic theory of DL 50 Implementing and training a model How does DL work? 50 in TF 69 DL model training 51 TF data loading logic 69 Components of DL frameworks 53 TF model definition 71 TF model training 77 The data loading logic 53 The model definition 53 Decomposing a complex, Model training logic 53 state-of-the-art model implementation 83 Implementing and training a model StyleGAN 84 in PyTorch 55 Implementation in PyTorch 86 Implementation in TF 91 PyTorch data loading logic 55 PyTorch model definition 57 Summary 94 4 Experiment Tracking, Model Management, and Dataset Versioning 95 Technical requirements 95 DL project tracking with MLflow and Overview of DL project tracking 96 DVC 104 Components of DL project tracking 96 Setting up MLflow 104 Tools for DL project tracking 97 Setting up MLflow with DVC 106 DL project tracking with Weights & Dataset versioning – beyond Weights Biases 99 & Biases, MLflow, and DVC 109 Summary 109 Setting up W&B 100 Table of Contents ix Part 2 – Building a Fully Featured Product 5 Data Preparation in the Cloud 113 Technical requirements 113 Creating a Glue job for ETL 132 Data processing in the cloud 114 Creating a Glue Data Catalog 133 Introduction to ETL 114 Setting up a Glue context 134 Data processing system architecture 114 Reading data 135 Defining the data processing logic 136 Introduction to Apache Spark 119 Writing data 136 Resilient distributed datasets and DataFrames 120 Utilizing SageMaker for ETL 137 Loading data 121 Processing data using Spark operations 122 Creating a SageMaker notebook 140 Processing data using user-defined functions 126 Running a Spark job through a SageMaker notebook 141 Exporting data 127 Running a job from a custom container Setting up a single-node EC2 through a SageMaker notebook 142 instance for ETL 128 Comparing the ETL solutions in AWS 144 Setting up an EMR cluster for ETL 130 Summary 145 6 Efficient Model Training 147 Technical requirements 147 Training a TensorFlow model using SageMaker 160 Training a model on a single machine 147 Training a PyTorch model using SageMaker 161 Training a model in a distributed fashion Utilizing multiple devices for training in using SageMaker 162 TensorFlow 148 SageMaker with Horovod 163 Utilizing multiple devices for training in PyTorch 150 Training a model using Horovod 164 Training a model on a cluster 151 Setting up a Horovod cluster 165 Configuring a TensorFlow training script Model parallelism 152 for Horovod 166 Data parallelism 155 Configuring a PyTorch training script Training a model using SageMaker 158 for Horovod 169 Setting up model training for SageMaker 159 Training a DL model on a Horovod cluster 170