Table Of Content

Joo-Young Kim Bongjin Kim Tony Tae-Hyoung Kim Editors Processing-in- Memory for AI From Circuits to Systems Processing-in-Memory for AI Joo-Young Kim • Bongjin Kim Tony Tae-Hyoung Kim Editors Processing-in-Memory for AI From Circuits to Systems Editors Joo-YoungKim BongjinKim KoreaAdvancedInstituteofScienceand UniversityofCalifornia Technology SantaBarbara,CA,USA Daejeon,Korea(Republicof) TonyTae-HyoungKim NanyangTechnologicalUniversity Nanyang,Singapore ISBN978-3-030-98780-0 ISBN978-3-030-98781-7 (eBook) https://doi.org/10.1007/978-3-030-98781-7 ©TheEditor(s)(ifapplicable)andTheAuthor(s),underexclusivelicensetoSpringerNatureSwitzerland AG2023,correctedpublication2023 Thisworkissubjecttocopyright.AllrightsaresolelyandexclusivelylicensedbythePublisher,whether thewholeorpartofthematerialisconcerned,specificallytherightsoftranslation,reprinting,reuse ofillustrations,recitation,broadcasting,reproductiononmicrofilmsorinanyotherphysicalway,and transmissionorinformationstorageandretrieval,electronicadaptation,computersoftware,orbysimilar ordissimilarmethodologynowknownorhereafterdeveloped. Theuseofgeneraldescriptivenames,registerednames,trademarks,servicemarks,etc.inthispublication doesnotimply,evenintheabsenceofaspecificstatement,thatsuchnamesareexemptfromtherelevant protectivelawsandregulationsandthereforefreeforgeneraluse. Thepublisher,theauthorsandtheeditorsaresafetoassumethattheadviceandinformationinthisbook arebelievedtobetrueandaccurateatthedateofpublication.Neitherthepublishernortheauthorsor theeditorsgiveawarranty,expressedorimplied,withrespecttothematerialcontainedhereinorforany errorsoromissionsthatmayhavebeenmade.Thepublisherremainsneutralwithregardtojurisdictional claimsinpublishedmapsandinstitutionalaffiliations. ThisSpringerimprintispublishedbytheregisteredcompanySpringerNatureSwitzerlandAG Theregisteredcompanyaddressis:Gewerbestrasse11,6330Cham,Switzerland Contents 1 Introduction .................................................................. 1 Joo-YoungKim 2 Backgrounds.................................................................. 15 ChengshuoYu,HyunjoonKim,BongjinKim, andTonyTae-HyoungKim 3 SRAM-BasedProcessing-in-Memory(PIM).............................. 41 HyunjoonKim,ChengshuoYu,andBongjinKim 4 DRAM-BasedProcessing-in-Memory..................................... 65 DonghyukKimandJoo-YoungKim 5 ReRAM-BasedProcessing-in-Memory(PIM)............................ 93 TonyTae-HyoungKim,LuLu,andYuzongChen 6 PIMforMLTraining........................................................ 121 JaehoonHeoandJoo-YoungKim 7 PIMSoftwareStack ......................................................... 143 DonghyukKimandJoo-YoungKim 8 Conclusion.................................................................... 161 Joo-YoungKim,BongjinKim,andTonyTae-HyoungKim Correctionto:Processing-in-MemoryforAI ................................. C1 v Chapter 1 Introduction Joo-YoungKim 1.1 HardwareAccelerationforArtificialIntelligenceand MachineLearning Artificialintelligence(AI)andmachinelearning(ML)technologyenablecomputers to mimic the cognitive tasks believed to be what only humans can do, such as recognition, understanding, and reasoning [1]. A deep ML model named AlexNet [2],whichuseseightlayersintotal,wonthefamouslarge-scaleimagerecognition competition called ImageNet by a significant margin over shallow ML models in 2012. Since then, deep learning (DL) revolution has been ignited and spread to many other domains such as speech recognition [3], natural language processing [4],virtualassistance[5],autonomousvehicle[6],androbotics[7].Withsignificant successes in various domains, DL revolutionizes a wide range of industry sectors suchasinformationtechnology,mobilecommunication,automotive,andmanufac- turing[8].However,asmoreindustriesadoptthenewtechnologyandmorepeople useitdaily,wefaceanever-increasingdemandforanewtypeofhardwareforthe workloads.ConventionalhardwareplatformssuchasCPUandGPUarenotsuitable for the new workloads. CPUs cannot cope with the tremendous amount of data transfers and computations required in the ML workloads, while GPUs consume largeamountsofpowerwithhighoperatingcosts. AI chip or accelerator is the hardware that enables faster and more energy- efficient processing for AI workloads (Fig.1.1). Over the past few years, many AI accelerators have been developed to serve the new workloads, targeting from battery-powerededgedevices[9–11]todatacenterservers[12].AsMcKinseypre- dictedinthereport[13],theAIsemiconductorindustryisexpectedtogrow18–19% J.-Y.Kim((cid:2)) SchoolofElectricalEngineering(E3-2),KAIST,Daejeon,SouthKorea e-mail:jooyoung1203@kaist.ac.kr ©TheAuthor(s),underexclusivelicensetoSpringerNatureSwitzerlandAG2023 1 J.-Y.Kimetal.(eds.),Processing-in-MemoryforAI, https://doi.org/10.1007/978-3-030-98781-7_1 2 J.-Y.Kim Fig.1.1 AIchipanditsmarketprediction every year to 65 billion, accounting for about 19% in the entire semiconductor market in 2025. So far, the AI hardware industry is led by big tech companies. Googledeveloped theirownAIchipnamedtensorprocessingunit(TPU)thatcan workwithTensorFlow[14]softwareframework.AmazondevelopedInferentiachip [15] for high-performance ML inference. Microsoft’s BrainWave [16] uses FPGA infrastructure to accelerate ML workloads at scale. Even an electric car maker Tesla developed the full self-driving (FSD) chip for autonomous vehicles. There are many start-up companies in this domain. Habana Labs, acquired by Intel in late 2019, developed Gaudi processor for AI training. Graphcore has developed intelligent processing unit (IPU) [17] and deployed in datacenters. Groq’s tensor streamingprocessor[18]optimizesdatastreamingandcomputationswithfixedtask scheduling.Cerabras’swafer-scaleengine[19]triestouseawholewaferasaML processortokeepalargemodelwithoutexternalmemories. 1.2 MachineLearningComputations In this section, we introduce the basic models of deep neural networks (DNNs) andtheircomputations.ADNNmodeliscomposedofmultiplelayersofartificial neurons, where neurons of each layer are inter-connected with the neurons in the neighborlayers.ThemathematicalmodeloftheneuroncomesfromFrankRosen- blatt’s Perceptron [20] model, as shown in Fig.1.2. Inspired by the human neuron model, it receives multiple inputs among many input neurons and accumulates their weighted sums with a bias. Then it decides the output through an activation function, where the activation function is non-linear and differentiable, having a step-likecharacteristicshape.Asaresult,theoutputofaneuronisexpressedwith thefollowingequation: 1 Introduction 3 Fig.1.2 Artificialneuron:perceptronmodel Fig.1.3 Fullyconnectedlayer y =f(w x +w x +...+w x +b) (1.1) 1 1 2 2 n n Based on the network connection, there are three major layers in the DNN models,whichdeterminestheactualcomputations:fullyconnected,convolutional, andrecurrentlayer. 1.2.1 FullyConnectedLayer Figure1.3showsthefullyconnectedlayerthatinterconnectstheneuronsintheinput layertotheneuronsinthenextlayer.Theinputvectoristhevaluesofinputneurons, a3×1vectorinthiscase,andtheoutputvectoristhevaluesofoutputneurons,a 4×1vector.Eachconnectionofthefullyconnectednetworkbetweenthetwolayers representsaweightparameterinthemodel.Forexample,W representsaweight 01 parameter of the connection between the input neuron 0 and the output neuron 1. Collectively,thenetworkbecomesa4×3weightmatrix.Inaddition,eachoutput neuronhasabias,sothelayerhas4×1biasvector.Foreachoutputneuron,wecan 4 J.-Y.Kim Fig.1.4 Convolutionallayer writetheoutputvaluey usingEq.1.1.Asaresult,theequationscanbeformulated intoamatrix-vectorequationasfollows: y =f(Wx+b) (1.2) Ifthemodelincludesmultiplelayers,whichisthecaseofdeepneuralnetworks,the matrix operations will be cascaded one by one. Traditional multi-layer perceptron (MLP)modelsaswellasthelatesttransformermodels[21]arebasedonthefully connectedlayer. 1.2.2 ConvolutionalLayer Theconvolutionallayeriterativelyperforms3-dconvolutionoperationsontheinput layer using multiple weight kernels to generate the output layer, as illustrated in Fig.1.4.Theinputlayerhasmultiple2-dinputfeaturemaps,sizedH× H× C,and thesizeofeachkernelisK× K× C.Forcomputation,itperforms3-dconvolution operationsfromtop-lefttobottom-rightforeachkernelwithastrideofU.Asingle convolutionoperationaccumulatesalltheinnerproductsbetweentheinputandthe kernel.Asaresultofscanningforakernel,itgetsasingleoutputfeaturemapsized E× E.Byrepeatingthisprocessforallkernels,theconvolutionallayerproduces thefinaloutputlayer,sizedtoE× E× M.Theequationforanoutputpointinthe convolutionallayerisasfollows: 1 Introduction 5 Fig.1.5 Recurrentlayer C(cid:2)−1K(cid:2)−1K(cid:2)−1 O[u][x][y]=B[u]+ I[k][Ux+i][Uy+j]W[u][k][i][j], k=1 i=1 j=1 (1.3) (H −R+U) 0≤u<M,0≤x,y <E,E = U Many convolution neural network (CNN) models use a number of convolutional layers with a few fully connected layers at the end for image classification and recognitiontask[22]. 1.2.3 RecurrentLayer Figure 1.5 shows the recurrent layer that has a feedback loop from the output to the input layer in the fully connected setting. In this layer, the cell state of the previous timestamp affects the current state. Its computation is also matrix-vector multiplicationbutinvolvesmultiplestepswithdependency.Itscellandoutputvalue are expressed as follows. The hyperbolic tangent is usually used for activation functionintherecurrentlayer. ht =f(Uhxt +Vhht−1+bh),Ot =f(Whht +bo) (1.4) Recurrentneuralnetworks(RNNs)suchasGRUs[23]andLSTMs[24]arebased onthistypeoflayerandpopularlyusedforspeechrecognition.

Processing-in-Memory for AI: From Circuits to Systems PDF

168 Pages·2022·7.484 MB·English

by Joo-Young Kim, Bongjin Kim, Tony Tae-Hyoung Kim

Checking for file health...

Download

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Processing-in-Memory for AI: From Circuits to Systems

See more

The list of books you might like

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.