ebook img

Mixed Low-precision Deep Learning Inference using Dynamic Fixed Point PDF

0.23 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Mixed Low-precision Deep Learning Inference using Dynamic Fixed Point

Mixed Low-precision Deep Learning Inference using Dynamic Fixed Point NaveenMellempudi1,AbhisekKundu1,DipankarDas1,DheevatsaMudigere1,andBharatKaul1 1ParallelComputingLab,IntelLabs Bangalore,India 7 1 0 2 Abstract b e F Weproposeacluster-basedquantizationmethodtoconvertpre-trainedfullpreci- sionweightsintoternaryweightswithminimalimpactontheaccuracy. Inaddition 1 wealsoconstraintheactivationsto8-bitsthusenablingsub8-bitfullintegerin- ferencepipeline. OurmethodusessmallerclustersofNfilterswithacommon ] G scalingfactortominimizethequantizationloss,whilealsomaximizingthenumber ofternaryoperations. WeshowthatwithclustersizeofN=4onResnet-101,can L achieve71.8%TOP-1accuracy,within6%ofthebestfullprecisionresult,while . s replacing≈85%ofallmultiplicationswith8-bitaccumulations. Usingthesame c methodwith4-bitweightsachieves76.3%TOP-1accuracywhichwithin2%of [ thefullprecisionresult. Wealsostudytheimpactofthesizeoftheclusteronboth 2 performanceandaccuracy, largerclustersizesN=64canreplace≈ 98%ofthe v multiplicationswithternaryoperationsbutintroducessignificantdropinaccuracy 8 whichnecessitatesfinetuningtheparameterswithretrainingthenetworkatlower 7 precision. Toaddressthiswehavealsotrainedlow-precisionResnet-50with8-bit 9 activationsandternaryweightsbypre-initializingthenetworkwithfullprecision 8 weightsandachieve68.9%TOP-1accuracywithin4additionalepochs. Ourfinal 0 quantized model can run on a full 8-bit compute pipeline, with a potential 16x . 1 improvementinperformancecomparedtobaselinefull-precisionmodels. 0 7 1 1 Introduction : v i DeepLearninghasachievedunparalleledsuccesswithlarge-scalemachinelearning. DeepLearning X modelsareusedforachievingstate-of-the-artresultsonawidevarietyoftasksincludingComputer r Vision,NaturalLanguageProcessing,AutomaticSpeechRecognitionandReinforcementLearning a [1]. Mathematicallythisinvolvessolvingacomplexnon-convexoptimizationproblemwithorder ofmillionsormoreparameters. Solvingthisoptimizationproblem-alsoreferredtoastrainingthe neuralnetworkisacompute-intensiveprocessthatforcurrentstate-of-artnetworksrequiresdays toweeks. Oncetrained,theDNNisusedbyevaluatingthismany-parameterfunctiononspecific inputdata-usuallyreferredtoasinference. Whilethecomputeintensityforinferenceismuchlower thanthatoftraining,inferencealsoinvolvessignificantamountofcompute. Moreover,owingtothe factthatinferenceisdoneonalargenumberofinputdata,thetotalcomputingresourcesspenton inferenceislikelytodwarfthosethatarespentontraining. Duetothelargeandsomewhatunique computerequirementsforbothdeeplearningtrainingandinferenceoperations,itmotivatestheuseof non-standardcustomizedarithmetic[6,2,5,14,8,7]andspecializedcomputehardwaretorunthese computationsasefficientlyaspossible[4,15,13,11]. Furthermore,theresometheoreticalevidence andnumerousempiricalobservationsthatdeeplearningoperationscanbesuccessfullydonewith muchlowerprecision. Inthisworkwefocusonreducingthecomputerequirementsfordeeplearninginference,bydirectly quantizingpre-trainedmodelswithminimum(orno)retrainingandachievenearstate-of-artaccuracy. Ourpapermakesthefollowingcontributions: 1. Weproposeanovelcluster-basedquantizationmethodtoconvertpre-trainedweightsto lowerprecisionrepresentationwithminimallossintestaccuracy. 2. OnResnet-101with8-bitactivationsandusingclustersize(N=4)toquantizeweights,we achieve76.3%TOP-1accuracywith4-bitweightsand71.8%TOP-1accuracywith2-bit ternaryweights. Tothebestofourknowledgethisisthebestreportedaccuracywithternary weightsonImageNetdataset[3],withoutretrainingthenetwork. 3. Weexploretheperformance-accuracytrade-offusingdifferentclustersizeswithternary weightrepresentation. ForaclustersizeofN,wereducethehigherprecisionops(8-bit multiply)tooneforeveryN ∗K2lowerprecisionops(8-bitaccumulation),whichresults significant reduction in computation complexity. Using smaller cluster size of N=4 we achievestate-of-the-accuracy,butlargerclustersizes(N=64)wouldrequireretrainingthe networkatlowerprecisiontoachievecomparableaccuracy. 4. Wetrainapre-initializedlowprecisionResnet-50using8-bitactivationsand2-bitweights usinglargercluster(N=64)andachieve68.9%TOP-1accuracyonImageNetdataset[3] within4-epochsoffine-tuning. 2 RelatedWork Deeplearningtrainingandinferencingarehighlycomputeintensiveoperations,howeverusingfull precision(FP32)computationsonconventionalhardwareisinefficientandnotstrictlywarranted fromfunctionalpoint-of-view. Toaddressthisissue,therehasbeenalotofinterestatusinglower precision for deep learning, in an attempt to identify the minimum required precision to ensure functionalcorrectnesswithinacceptablethresholds. Inthepastmanyresearcheshaveproposedlow-precisionalternativestoperformdeeplearningtasks. Vanhouckeetal.[12]showedthatusing8-bitfixed-pointarithmeticconvolutionnetworkscanbesped upbyupto10xonspeechrecognitiontasksongeneralpurposeCPUhardware. Guptaetal.[4]have successfullytrainednetworksusing16-bitfixedpointoncustomhardware. Miyashitaetal.[8]used logquantizationonpre-trainedmodelsandachievedgoodaccuracybytuningthebitlengthforeach layer. Morerecently,Venkateshetal.[13]achievednearstateoftheartresultsusing32bactivations with2-bitternaryweightsonImagenetdataset. Hubaraetal.[5]havedemonstratedthatwithweights asbinaryvaluestrainingfromscratchcanachievenearstate-of-the-artresultsforILSVRC2012 imageclassificationtask[9]. 3 LowPrecisionInference Inthispaper,weprimarilyfocusonimprovingtheperformanceandaccuracyoftheinferencetask.We explorepossibilityofachievinghighaccuracyusingsub8-bitprecisiononstate-of-the-artnetworks withoutexpensiveretraining. PreviousworkfromMiyashitaetal.[8]showedthatbycompressing thedynamicrangeoftheinput, itispossibletominimizethequantizationlossandachievehigh accuracy. Wetakeadifferentapproachtominimizetheimpactofdynamicrangeonquantization. We proposeacluster-basedquantizationmethodthatgroupsweightsintosmallerclustersandquantize eachclusterwithauniquescalingfactor. Weusestaticclusteringtogroupfiltersthataccumulateto thesameoutputfeaturetosimplifytheconvolutionoperations. Empiricalevidencealsosuggeststhat theseclusterswhichlearnsimilarfeaturestendtohavesmallerdynamicrange. Usingdynamicfixed pointrepresentation,thismethodcaneffectivelyminimizethequantizationerrorsandimprovethe inferenceaccuracyofquantizednetworks. Applyingthisschemeonapre-trainedResnet-101model, with4-bitweightsand8-bitactivations,weachieve76.3%TOP-1accuracyonImageNetdataset[3], withoutanyretraining. 3.1 2-bitTernaryWeights Goingbelow4-bitsweusethetheternaryrepresentationforweights,followingthethresholdbased approximationproposedbyLietal[7],i.e.,approximatefull-precisionweightW≈αWˆ in(cid:96) norm 2 2 Algorithm1TernarizeWeights 1: Input: Learnedfull-precisionweightsWofalayerwithdfilters. 2: Groupfiltersintokclusters: {Gj},j =1,...,k. LetN =|Gj|(numberoffiltersinGj). 3: ForeachclusterGj 4: RunAlgorithm2oneachfilterW∈Gj,andstorethethresholdsasavectorα. 5: Fort=1,...,N, Tt ={i:αibelongstothetoptelementsofsortedα}. (cid:113) 6: Setαt = (cid:80)i∈Ttαi2/|Tt|. 7: ConstructWˆ (t),suchthat,Wˆ (it) =Sign(Wi),if|Wi|>αt,and0otherwise. 8: Findαt∗ andWˆ ∗(t)thatminimizes(cid:80)W∈Gj(cid:107)W−αtWˆ (t)(cid:107)2F. 9: Letαˆt∗ beareduced-precisionrepresentationofαt∗. 10: Output: knumberofαˆt∗ andthegroupofternaryweightsWˆ . Algorithm2ThresholdSelection 1: Input: W∈Rn. 2: SortelementsofWaccordingtomagnitude. 3: Forτ ∈[0,1],Iτ ={i:|Wi|belongstothetop(cid:98)τ ·n(cid:99)elementsofsortedlist}. 4: ConstructWˆ (τ),suchthat,Wˆ (iτ) =Sign(Wi),fori∈Iτ,and0otherwise. (cid:113) 5: Setατ = (cid:80)i∈Iτ W2i/|Iτ|. 6: Computeατ∗ thatminimizes(cid:107)W−ατWˆ (τ)(cid:107)2F,forτ ∈[0,1]. 7: Output: ατ∗ (cid:107)·(cid:107) ,whereαisascalingfactorandWˆ isaternaryweightwithWˆ ∈{−1,0,+1}. WhereWis F i thematrixrepresentinglearnedfull-precisionweights,andWˆ representsthecorrespondingternary representation. Weapplytheblock-quantizationmethoddescribedinsection-3,tocomputemultiple scaling factors for each layer to minimize the accuracy loss. Our method differs from [7] in the approximationusedforcomputingscalingfactor(α). WeusetheRMS formulationasshownbythe equation(1). TheintuitionbehindusingRMStermistopushthethresholdparametertowardslarger valueswithintheclusterwhichhelpsspeedupweightpruning. (cid:115) (cid:80) W2 α= i∈Iτ i,where |I | isthenumberofelementsinI . (1) |I | τ τ τ Inaddition,werunoursearchalgorithm1inhierarchicalfashionbyminimizingtheerrorwithineach filterfirstandthenwithintheclusteroffilters. Experimentalevidenceshowsthattheseimprovements helpfindingtheoptimalscalingfactorthatminimizesquantizationloss. Usingmultiplescalingfactorscanleadtomore8-bitmultiplications. Hence,wechoosethecluster sizecarefullytoimprovetheratiooflow-precision(2-bit)tohigh-precision(8-bit)operations(Section 3.3). Ouralgorithm(Algorithm1)takesthefull-precisionlearnedweightsandreturnsclustersof ternaryrepresentationofgroupsofkernelsalongwiththeirscalingfactors. Wefurtherquantizethe scalingfactorsdownto8-bittoeliminateanyoperationthatrequiresmorethan8bits. Applyingthis schemeonpre-trainedResNet-101model,using8-bitactivationsweachieve71.8%TOP-1accuracy onImageNetdataset. 3.2 C1andBatchNormLayers Inourexperimentswekeepweightsofthefirstconvolutionlayersat8-bitstopreventfromaccumu- latinglosseswhiletherestofthelayersincludingfullyconnectedlayersoperateatlowerprecision. We also recompute the batch norm parameters during the inference phase to compensate for the shiftinvarianceintroducedbyquantization. Thisisessentialformakingitwork,whenwearenot retrainingatlowerprecision. Weareexploringthepossibilityoffusingbatchnormalizationlayers withtheconvolutionlayersbeforequantizationtoavoidthisextracomputation. 3 Figure1: Resnet-101resultsonImageNetdatasetusing8-bitactivationswith4-bitweights8a−4w and2-bitweights8a−2w 3.3 PerformanceImplications Choosingtherightclustersizeisatrade-offbetweenperformanceandaccuracy,whilehavingone clusterperlayerfavorshighercomputedensitybyeliminatingallmultiplications,it’snotidealfor achievinghighaccuracy. Although,previousresearchinthisspace[13]showedthatitispossible torecoversomeofthislostaccuracythroughretraining. It’snotalwaysideal,becauseofthecosts involved in retraining these networks in low-precision, not to mention the technical difficulties involvedinachievingreasonablesolutiononthesenetworks. Weexploredtheaccuracy-performancetrade-offwithvariousclustersizes. Ourexperimentsshow on Resnet-101, using a cluster size of N = 4 can achieve 71.8% TOP-1 accuracy, within 6% of thefullprecisionresult. Thisresultsignificantbecausethisistothebestofourknowledgehighest accuracyachievedonImagenetdataset[3]withoutretrainingthenetworkinlow-precision. Interms ofperformanceimpact, theclusteringwillresultinone8-bitmultiplicationfortheentirecluster (N ∗K2)ofternaryaccumulations. Assumingroughly50%oftheconvolutionsare3x3andthe rest are 1x1, using this block size can potentially replace 85% of multiplications in Resnet-101 convolutionlayerswithsimple8-bitaccumulations. Fornetworksthatpredominantlyusefiltersthat are 3x3 or bigger, this ratio would be greater than 95%. We explored the accuracy-performance trade-offwithvariousclustersizes,weconcludedthatusingclustersizeofN =64,wecanreplace ≈98%ofmultiplicationsinResnet-101with8-bitaccumulations,butwithasignificantlosstothe accuracy. Atthispointretrainingthenetworkatlowerprecisionwouldbenecessary. 4 TrainingwithLow-precision WetrainedthelowprecisionResNet-50onImageNetdatasetusing2-bitweightsand8-bitactivations byinitializingthenetworkwithpre-trainedfullprecisionmodel. Wetaketheapproachproposed byMarceletal.[10],andreplacedatapre-processingstepssuchasmean-subtractionandjittering withbatchnormalizationlayerinsertedrightafterthedatalater. Weobtainedthepre-trainedmodels publishedbyMarceletal.[10]andfine-tunetheparametersofourlow-precisionnetwork. Inthe forwardpass,theweightsareconvertedto2-bitternaryvaluesusingthealgorithmdescribedin1 inallconvolutionlayers,exceptthefirstlayer,wheretheweightsarequantizedto8-bitfixedpoint representation.Activationsarequantizedto8-bitfixedpointinalllayersincludingReLU,BatchNorm layers. WedidnotquantizetheweightsinFClayerforthetrainingexercise. Gradientupdatesare performedinfullprecisionforconvolutionandFClayers. Wereducedthelearningratetoanorderof 1e-4,inordertoavoidexplodinggradientsproblem,whilewekeepalltheotherhyperparameters sameasthatoffullprecisiontraining. Afterrunningfor4-epochs,werecoveredmostoftheaccuracy 4 and achieved 68.6% Top-1 and 88.7% Top-5 accuracy compared to our baseline 75.02%(Top-1), 92.2%(Top-5). Figure2: Fine-tuningResnet-50withpre-initializedweightsonImagenetdataset. 5 Conclusion Weproposeaclusteringbasedquantizationmethodwhichexploitslocalcorrelationsindynamic rangeoftheparameterstominimizetheimpactofquantizationonoverallaccuracy. Wedemonstrate nearSOTAaccuracyonImagenetdata-setusingpre-trainedmodelswithquantizednetworkswithout anylowprecisiontraining. OnResnet-101using8-bitactivationstheerrorfromthebestpublished fullprecision(FP32)resultiswithin≈6%forternaryweightsandwithin≈2%for4-bitweights. TothebestofourknowledgethisisthebestachievedaccuracywithternaryweightsforImagenet dataset. Our clustering based approach allows for tailoring solutions for specific hardware, based on the accuracy and performance requirements. Smaller cluster sizes achieves best accuracy, with N=4 ≈ 85% of the computations as low precision operations (simple 8-bit accumulations) and this is bettersuitedforimplementationonspecializedhardware. Largerclustersizesaremoresuitedto currentgeneralpurposehardware,withalargerportionofcomputationsaslowprecisionoperations (>98%forN=64),howeverthiscomeswiththecostofreducedaccuracy. Thisgapcanbebridged withadditionallowprecisiontrainingasshowinsection4,workisunderwaytofurtherimprove thisaccuracy. Ourfinalquantizedmodelcanbeefficientlyrunonfull8-bitcomputepipeline,thus offeringapotential16X performance-powerbenefit. Furthermoreascontinuationofthiswork,wearelookinginamoretheoreticalexplorationtobetter understandtheformalrelationshipbetweentheclusteringandfinalaccuracy,withanattemptestablish realisticboundsforgivennetwork-performance-accuracyrequirement. 5 References [1] YoshuaBengio,IanGoodfellow,andAaronCourville. Deeplearning. Bookinpreparationfor MITPress,2016. [2] MatthieuCourbariaux,ItayHubara,DanielSoudry,RanEl-Yaniv,andYoshuaBengio.Binarized neuralnetworks: Trainingdeepneuralnetworkswithweightsandactivationsconstrainedto+1 or-1. arXivpreprintarXiv:1602.02830,2016. [3] JiaDeng,WeiDong,RichardSocher,Li-JiaLi,KaiLi,andLiFei-Fei. Imagenet: Alarge-scale hierarchicalimagedatabase. InComputerVisionandPatternRecognition,2009.CVPR2009. IEEEConferenceon,pages248–255.IEEE,2009. [4] SuyogGupta,AnkurAgrawal,KailashGopalakrishnan,andPritishNarayanan. Deeplearning withlimitednumericalprecision. InICML,pages1737–1746,2015. [5] ItayHubara,MatthieuCourbariaux,DanielSoudry,RanEl-Yaniv,andYoshuaBengio.Binarized neuralnetworks. InAdvancesinNeuralInformationProcessingSystems,pages4107–4115, 2016. [6] ItayHubara,MatthieuCourbariaux,DanielSoudry,RanEl-Yaniv,andYoshuaBengio. Quan- tizedneuralnetworks: Trainingneuralnetworkswithlowprecisionweightsandactivations. arXivpreprintarXiv:1609.07061,2016. [7] FengfuLi,BoZhang,andBinLiu. Ternaryweightnetworks. arXivpreprintarXiv:1605.04711, 2016. [8] DaisukeMiyashita,EdwardHLee,andBorisMurmann. Convolutionalneuralnetworksusing logarithmicdatarepresentation. arXivpreprintarXiv:1603.01025,2016. [9] OlgaRussakovsky,JiaDeng,HaoSu,JonathanKrause,SanjeevSatheesh,SeanMa,Zhiheng Huang,AndrejKarpathy,AdityaKhosla,MichaelBernstein,etal. Imagenetlargescalevisual recognitionchallenge. InternationalJournalofComputerVision,115(3):211–252,2015. [10] MarcelSimon,ErikRodner,andJoachimDenzler. Imagenetpre-trainedmodelswithbatch normalization. arXivpreprintarXiv:1612.01452v2,2016. [11] Yaman Umuroglu, Nicholas J Fraser, Giulio Gambardella, Michaela Blott, Philip Leong, MagnusJahre,andKeesVissers. Finn: Aframeworkforfast,scalablebinarizedneuralnetwork inference. arXivpreprintarXiv:1612.07119,2016. [12] VincentVanhoucke,AndrewSenior,andMarkZMao. Improvingthespeedofneuralnetworks oncpus.InProc.DeepLearningandUnsupervisedFeatureLearningNIPSWorkshop,volume1, page4.Citeseer,2011. [13] Ganesh Venkatesh, Eriko Nurvitadhi, and Debbie Marr. Accelerating deep convolutional networksusinglow-precisionandsparsity. arXivpreprintarXiv:1610.00324,2016. [14] ShuchangZhou,YuxinWu,ZekunNi,XinyuZhou,HeWen,andYuhengZou. Dorefa-net: Traininglowbitwidthconvolutionalneuralnetworkswithlowbitwidthgradients. arXivpreprint arXiv:1606.06160,2016. [15] ChenzhuoZhu,SongHan,HuiziMao,andWilliamJDally. Trainedternaryquantization. arXiv preprintarXiv:1612.01064,2016. 6

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.