Machine Learning on Commodity Tiny Devices This book aims at the tiny machine learning (TinyML) software and hardware synergy for edge intelligence applications. This book presents on-device learning techniques covering model- level neural network design, algorithm-level training optimization and hardware-level instruction acceleration. Analyzing the limitations of conventional in-cloud computing would reveal that on-device learning is a promising research direction to meet the requirements of edge intelligence ap- plications. As to the cutting-edge research of TinyML, implementing a high-efficiency learning framework and enabling system-level acceleration is one of the most fundamental issues. This book presents a comprehensive discussion of the latest research progress and provides system- level insights on designing TinyML frameworks, including neural network design, training algo- rithm optimization and domain-specific hardware acceleration. It identifies the main challenges when deploying TinyML tasks in the real world and guides the researchers to deploy a reliable learning system. This book will be of interest to students and scholars in the field of edge intelligence, especially to those with sufficient professional Edge AI skills. It will also be an excellent guide for research- ers to implement high-performance TinyML systems. Song Guo is a Full Professor leading the Edge Intelligence Lab and Research Group of Network- ing and Mobile Computing at the Hong Kong Polytechnic University. Professor Guo is a Fellow of the Canadian Academy of Engineering, Fellow of the IEEE, Fellow of the AAIA and Clarivate Highly Cited Researcher. Qihua Zhou is a PhD student with the Department of Computing at the Hong Kong Polytechnic University. His research interests include distributed AI systems, large-scale parallel processing, TinyML systems and domain-specific accelerators. Taylor & Francis Taylor & Francis Group http://taylorandfrancis.com Machine Learning on Commodity Tiny Devices Theory and Practice Song Guo and Qihua Zhou First edition published 2023 by CRC Press 6000 Broken Sound Parkway NW, Suite 300, Boca Raton, FL 33487-2742 and by CRC Press 4 Park Square, Milton Park, Abingdon, Oxon, OX14 4RN CRC Press is an imprint of Taylor & Francis Group, LLC © 2023 Song Guo and Qihua Zhou Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot as- sume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint. Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including pho- tocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers. For permission to photocopy or use material electronically from this work, access www.copyright.com or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. For works that are not available on CCC please contact [email protected] Trademark notice: Product or corporate names may be trademarks or registered trademarks and are used only for identification and explanation without intent to infringe. ISBN: 978-1-032-37423-9 (hbk) ISBN: 978-1-032-37426-0 (pbk) ISBN: 978-1-003-34022-5 (ebk) DOI: 10.1201/9781003340225 Typeset in CMR10 font by KnowledgeWorks Global Ltd. Publisher’s note: This book has been prepared from camera-ready copy provided by the authors. Contents List of Figures xiii List of Tables xvii Chapter 1■ Introduction 1 1.1 WHATISMACHINELEARNINGONDEVICES? 1 1.2 ON-DEVICELEARNINGANDTINYMLSYSTEMS 3 1.2.1 Property of On-Device Learning 3 1.2.2 Objectives of TinyML Systems 4 1.3 CHALLENGESFORREALISTICIMPLEMENTATION 5 1.4 PROBLEMSTATEMENTOFBUILDINGTINYMLSYSTEMS 6 1.5 DEPLOYMENTPROSPECTSANDDOWNSTREAMAPPLICATIONS 7 1.5.1 Evaluation Metrics for Practical Methods 7 1.5.2 Intelligent Medical Diagnosis 8 1.5.3 AI-Enhanced Motion Tracking 8 1.5.4 Domain-Specific Acceleration Chips 8 1.6 THESCOPEANDORGANIZATIONOFTHISBOOK 9 Chapter 2■ Fundamentals: On-Device Learning Paradigm 11 2.1 MOTIVATION 11 2.1.1 Drawbacks of In-Cloud Learning 12 2.1.2 Rise of On-Device Learning 12 2.1.3 Bit Precision and Data Quantization 13 2.1.4 Potential Gains 14 2.1.5 Why Not Existing Quantization Methods? 14 2.2 BASICTRAININGALGORITHMS 15 2.2.1 Stochastic Gradient Descent 15 2.2.2 Mini-Batch Stochastic Gradient Descent 16 v vi ■ Contents 2.2.3 Training of Neural Networks 17 2.3 PARAMETERSYNCHRONIZATIONFORDISTRIBUTEDTRAINING 18 2.3.1 Parameter Server Paradigm 18 2.3.2 Parameter Synchronization Pace 19 2.3.3 Heterogeneity-Aware Distributed Training 19 2.4 MULTI-CLIENTON-DEVICELEARNING 20 2.4.1 Preliminary Experiments 20 2.4.2 Observations 20 2.4.2.1 Training Convergence Efficiency 20 2.4.2.2 Synchronization Frequency 21 2.4.2.3 Communication Traffic 21 2.4.3 Summary 22 2.5 DEVELOPINGKITSANDEVALUATIONPLATFORMS 22 2.5.1 Devices 22 2.5.2 Benchmarks 22 2.5.3 Pipeline 22 2.6 CHAPTERSUMMARY 23 Chapter 3■ Preliminary: Theories and Algorithms 25 3.1 ELEMENTSOFNEURALNETWORKS 25 3.1.1 Fully Connected Network 25 3.1.2 Convolutional Neural Network 26 3.1.3 Attention-Based Neural Network 26 3.2 MODEL-ORIENTEDOPTIMIZATIONALGORITHMS 27 3.2.1 Tiny Transformer 27 3.2.2 Quantization Strategy for Transformer 30 3.3 PRACTICEONSIMPLECONVOLUTIONALNEURALNETWORKS 31 3.3.1 PyTorch Installation 31 3.3.1.1 On macOS 31 3.3.1.2 On Windows 32 3.3.2 CIFAR-10 Dataset 32 3.3.3 Construction of CNN Model 34 3.3.3.1 Convolutional Layers 34 3.3.3.2 Activation Layers 35 3.3.3.3 Pooling Layers 35 3.3.3.4 Fully Connected Layers 37 Contents ■ vii 3.3.3.5 Structure of LeNet-5 38 3.3.4 Model Training 39 3.3.5 Model Testing 40 3.3.6 GPU Acceleration 41 3.3.6.1 CUDA Installation 42 3.3.6.2 Programming for GPU 42 3.3.7 Load Pre-Trained CNNs 43 Chapter 4■ Model-Level Design: Computation Acceleration and Communication Saving 45 4.1 OPTIMIZATIONOFNETWORKARCHITECTURE 45 4.1.1 Network-Aware Parameter Pruning 46 4.1.1.1 Pruning Steps 46 4.1.1.2 Pruning Strategy 47 4.1.1.3 Pruning Metrics 47 4.1.1.4 Summary 48 4.1.2 Knowledge Distillation 48 4.1.2.1 Combination of Loss Functions 49 4.1.2.2 Tuning of Hyper-Parameters 50 4.1.2.3 Usage of Model Training 50 4.1.2.4 Summary 51 4.1.3 Model Fine-Tuning 51 4.1.3.1 Transfer Learning 51 4.1.3.2 Layer-Wise Freezing and Updating 52 4.1.3.3 Model-Wise Feature Sharing 53 4.1.3.4 Summary 54 4.1.4 Neural Architecture Search 54 4.1.4.1 Search Space of HW-NAS 55 4.1.4.2 Targeted Hardware Platforms 56 4.1.4.3 Trend of Current HW-NAS Methods 56 4.2 OPTIMIZATIONOFTRAININGALGORITHM 56 4.2.1 Low Rank Factorization 57 4.2.2 Data-Adaptive Regularization 57 4.2.2.1 Core Formulation 57 4.2.2.2 On-Device Network Sparsification 58 4.2.2.3 Block-Wise Regularization 59 viii ■ Contents 4.2.2.4 Summary 59 4.2.3 Data Representation and Numerical Quantization 60 4.2.3.1 Elements of Quantization 61 4.2.3.2 Post-Training Quantization 63 4.2.3.3 Quantization-Aware Training 63 4.2.3.4 Summary 66 4.3 CHAPTERSUMMARY 66 Chapter 5■ Hardware-LevelDesign:NeuralEnginesandTensor Accelerators 67 5.1 ON-CHIPRESOURCESCHEDULING 67 5.1.1 Embedded Memory Controlling 67 5.1.2 Underlying Computational Primitives 68 5.1.3 Low-Level Arithmetical Instructions 68 5.1.4 MIMO-Based Communication 69 5.2 DOMAIN-SPECIFICHARDWAREACCELERATION 70 5.2.1 Multiple Processing Primitives Scheduling 70 5.2.2 I/O Connection Optimization 70 5.2.3 Cache Management 71 5.2.4 Topology Construction 71 5.3 CROSS-DEVICEENERGYEFFICIENCY 72 5.3.1 Multi-Client Collaboration 72 5.3.2 Efficiency Analysis 72 5.3.3 Problem Formulation for Energy Saving 74 5.3.4 Algorithm Design and Pipeline Overview 75 5.4 DISTRIBUTEDON-DEVICELEARNING 75 5.4.1 Community-Aware Synchronous Parallel 76 5.4.2 Infrastructure Design 77 5.4.3 Community Manager 77 5.4.4 Weight Learner 77 5.4.4.1 Distance Metric Learning 77 5.4.4.2 Asynchronous Advantage Actor-Critic 78 5.4.4.3 Agent Learning Methodology 79 5.4.5 Distributed Training Controller 80 5.4.5.1 Intra-Community Synchronization 80 5.4.5.2 Inter-Community Synchronization 80 Contents ■ ix 5.4.5.3 Communication Traffic Aggregation 81 5.5 CHAPTERSUMMARY 81 Chapter 6■ Infrastructure-Level Design: Serverless and Decentralized Machine Learning 83 6.1 SERVERLESSCOMPUTING 83 6.1.1 Definition of Serverless Computing 83 6.1.2 Architecture of Serverless Computing 86 6.1.2.1 Virtualization Layer 86 6.1.2.2 Encapsulation Layer 87 6.1.2.3 System Orchestration Layer 88 6.1.2.4 System Coordination Layer 90 6.1.3 Benefits of Serverless Computing 90 6.1.4 Challenges of Serverless Computing 91 6.1.4.1 Programming and Modeling 91 6.1.4.2 Pricing and Cost Prediction 91 6.1.4.3 Scheduling 91 6.1.4.4 Intra-Communications of Functions 92 6.1.4.5 Data Caching 93 6.1.4.6 Security and Privacy 93 6.2 SERVERLESSMACHINELEARNING 94 6.2.1 Introduction 94 6.2.2 Machine Learning and Data Management 94 6.2.3 Training Large Models in Serverless Computing 94 6.2.3.1 Data Transfer and Parallelism in Serverless Computing 95 6.2.3.2 Data Parallelism for Model Training in Serverless Computing 95 6.2.3.3 Optimizing Parallelism Structure in Serverless Training 96 6.2.4 Cost-Efficiency in Serverless Computing 96 6.3 CHAPTERSUMMARY 97 Chapter 7■ System-Level Design: From Standalone to Clusters 99 7.1 STALENESS-AWAREPIPELINING 99 7.1.1 Data Parallelism 100