ebook img

Optimization Algorithms for Distributed Machine Learning PDF

137 Pages·2023·4.476 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Optimization Algorithms for Distributed Machine Learning

Synthesis Lectures on Learning, Networks, and Algorithms Gauri Joshi Optimization Algorithms for Distributed Machine Learning Synthesis Lectures on Learning, Networks, and Algorithms SeriesEditor LeiYing,ECE,UniversityofMichigan–AnnArbor,AnnArbor,USA The series publishes short books on the design, analysis, and management of complex networkedsystemsusingtoolsfromcontrol,communications,learning,optimization,and stochasticanalysis.EachLectureisaself-containedpresentationofonetopicbyaleading expert. The topics include learning, networks, and algorithms, and cover a broad spec- trumofapplicationstonetworkedsystemsincludingcommunicationnetworks,data-center networks, social, and transportation networks. Gauri Joshi Optimization Algorithms for Distributed Machine Learning GauriJoshi CarnegieMellonUniversity Pittsburgh,PA,USA ISSN2690-4306 ISSN2690-4314 (electronic) SynthesisLecturesonLearning,Networks,andAlgorithms ISBN978-3-031-19066-7 ISBN978-3-031-19067-4 (eBook) https://doi.org/10.1007/978-3-031-19067-4 ©TheEditor(s)(ifapplicable)andTheAuthor(s),underexclusivelicensetoSpringerNature SwitzerlandAG2023 Thisworkissubjecttocopyright.AllrightsaresolelyandexclusivelylicensedbythePublisher,whetherthewhole orpartofthematerialisconcerned,specificallytherightsoftranslation,reprinting,reuseofillustrations,recitation, broadcasting,reproductiononmicrofilmsorinanyotherphysicalway,andtransmissionorinformationstorage andretrieval,electronicadaptation,computersoftware,orbysimilarordissimilarmethodologynowknownor hereafterdeveloped. Theuseofgeneraldescriptivenames,registerednames,trademarks,servicemarks,etc.inthispublicationdoes notimply,evenintheabsenceofaspecificstatement,thatsuchnamesareexemptfromtherelevantprotective lawsandregulationsandthereforefreeforgeneraluse. Thepublisher,theauthors,andtheeditorsaresafetoassumethattheadviceandinformationinthisbookare believedtobetrueandaccurateatthedateofpublication.Neitherthepublishernortheauthorsortheeditorsgive awarranty,expressedorimplied,withrespecttothematerialcontainedhereinorforanyerrorsoromissionsthat mayhavebeenmade.Thepublisherremainsneutralwithregardtojurisdictionalclaimsinpublishedmapsand institutionalaffiliations. ThisSpringerimprintispublishedbytheregisteredcompanySpringerNatureSwitzerlandAG Theregisteredcompanyaddressis:Gewerbestrasse11,6330Cham,Switzerland Tomystudentsandcollaboratorsfortheir dedicatedworkandinsightful discussions withoutwhichthisbookwouldnothavebeen possible Tomyfamilyfortheirunconditional support andencouragement Preface Stochasticgradientdescentisthebackboneofsupervisedmachinelearningtrainingtoday. Classical SGD was designed to be run on a single computing node, and its error con- vergence with respect to the number of iterations has been extensively analyzed and improvedinoptimizationandlearningtheoryliterature.However,duetothemassivetrain- ing datasets and models used today, running SGD at a single node can be prohibitively slow. This calls for distributed implementations of SGD, where gradient computation and aggregation are split across multiple worker nodes. Although parallelism boosts the amount of data processed per iteration, it exposes SGD to unpredictable node slowdown and communication delays stemming from variability in the computing infrastructure. Thus,thereisacriticalneedtomakedistributedSGDfast,yetrobusttosystemvariability. In this book, we will discuss state-of-the-art algorithms in large-scale machine learn- ing that improve the scalability of distributed SGD via techniques such as asynchronous aggregation, local updates, quantization and decentralized consensus. These methods reduce the communication cost in several different ways—asynchronous aggregation allows overlap between communication and local computation, local updates reduce the communication frequency thus amortizing the communication delay across several iter- ations, quantization and sparsification methods reduce the per-iteration communication time, and decentralized consensus offers spatial communication reduction by allowing different nodes in a network topology to train models and average them with neighbors in parallel. For each of the distributed SGD algorithms presented here, the book also provides an analysis of its convergence. However, unlike traditional optimization literature, we do not only focus on the error versus iterations convergence, or the iteration complexity. In distributed implementations, it is important to study the error versus wallclock time con- vergence because the wallclock time taken to complete each iteration is impacted by the synchronizationandcommunicationprotocol.Wemodelcomputationandcommunication delaysasrandomvariablesanddeterminetheexpectedwallclockruntimeperiterationof the various distributed SGD algorithms presented in this book. By pairing this runtime analysis with the error convergence analysis, one can get a true comparison of the con- vergence speed of different algorithms. The book advocates a system-aware philosophy, vii viii Preface which is cognizant of computation, synchronization and communication delays, toward the design and analysis of distributed machine learning algorithms. This book would not have been possible without the wonderful research done by my students and collaborators. I thank them for helping me learn the material presented in this book. Our research was generously supported by several funding agencies including the National Science Foundations and research awards from IBM, Google and Meta. I was also inspired by the enthusiasm of the students who took my class on large-scale machine learninginfrastructure over the past few years. Finally, I am immenselygrateful to my family and friends for their constant support and encouragement. Pittsburgh, PA, USA Gauri Joshi August 2022 Contents 1 DistributedOptimizationinMachineLearning ......................... 1 1.1 SGD in Supervised Machine Learning .............................. 1 1.1.1 Training Data and Hypothesis ............................... 1 1.1.2 Empirical Risk Minimization ................................ 2 1.1.3 Gradient Descent .......................................... 2 1.1.4 Stochastic Gradient Descent ................................. 3 1.1.5 Mini-batch SGD ........................................... 3 1.1.6 Linear Regression .......................................... 4 1.1.7 Logistic Regression ........................................ 6 1.1.8 Neural Networks ........................................... 6 1.2 Distributed Stochastic Gradient Descent ............................. 7 1.2.1 The Parameter Server Framework ............................ 8 1.2.2 The System-Aware Design Philosophy ........................ 8 1.3 Scalable Distributed SGD Algorithms ............................... 9 1.3.1 Straggler-Resilient and Asynchronous SGD ................... 9 1.3.2 Communication-Efficient Distributed SGD .................... 10 1.3.3 Decentralized SGD ........................................ 10 References ........................................................... 11 2 Calculus,ProbabilityandOrderStatisticsReview ....................... 13 2.1 Calculus and Linear Algebra ....................................... 13 2.1.1 Norms and Inner Products .................................. 13 2.1.2 Lipschitz Continuity and Smoothness ......................... 14 2.1.3 Strong Convexity .......................................... 16 2.2 Probability Review ............................................... 17 2.2.1 Random Variable .......................................... 18 2.2.2 Expectation and Variance ................................... 18 2.2.3 Some Canonical Random Variables .......................... 19 2.2.4 Bayes Rule and Conditional Probability ...................... 20 ix x Contents 2.3 Order Statistics .................................................. 22 2.3.1 Order Statistics of the Exponential Distribution ................ 22 2.3.2 Order Statistics of the Uniform Distribution ................... 24 2.3.3 Asymptotic Distribution of Quantiles ......................... 24 3 ConvergenceofSGDandVariance-ReducedVariants .................... 27 3.1 Gradient Descent (GD) Convergence ............................... 27 3.1.1 Effect of Learning Rate and Other Parameters ................. 28 3.1.2 Iteration Complexity ....................................... 29 3.2 Convergence Analysis of Mini-batch SGD ........................... 29 3.2.1 Effect of Learning Rate and Mini-batch Size .................. 32 3.2.2 Iteration Complexity ....................................... 32 3.2.3 Non-convex Objectives ..................................... 33 3.3 Variance-Reduced SGD Variants ................................... 34 3.3.1 Dynamic Mini-batch Size Schedule .......................... 34 3.3.2 Stochastic Average Gradient (SAG) .......................... 35 3.3.3 Stochastic Variance Reduced Gradient (SVRG) ................ 36 References ........................................................... 38 4 SynchronousSGDandStraggler-ResilientVariants ...................... 39 4.1 Parameter Server Framework ...................................... 39 4.2 Distributed Synchronous SGD Algorithm ........................... 40 4.3 Convergence Analysis ............................................ 41 4.3.1 Iteration Complexity ....................................... 42 4.4 Runtime per Iteration ............................................. 42 4.4.1 Gradient Computation and Communication Time .............. 43 4.4.2 Expected Runtime per Iteration .............................. 43 4.4.3 Error Versus Runtime Convergence .......................... 45 4.5 Straggler-Resilient Variants ........................................ 45 4.5.1 K-Synchronous SGD ....................................... 46 4.5.2 K-Batch-Synchronous SGD ................................. 47 References ........................................................... 49 5 AsynchronousSGDandStaleness-ReducedVariants .................... 51 5.1 The Asynchronous SGD Algorithm ................................. 51 5.1.1 Comparison with Synchronous SGD ......................... 52 5.2 Runtime Analysis ................................................ 53 5.2.1 Runtime Speed-Up Compared to Synchronous SGD ............ 53 5.3 Convergence Analysis ............................................ 54 5.3.1 Implications of the Asynchronous SGD Convergence Bound .... 57 5.4 Staleness-Reduced Variants of Asynchronous SGD ................... 57 5.4.1 K-Asynchronous SGD ..................................... 58

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.