HIGHLANDER,RODRIGUEZ:EFFICIENTTRAININGOFCNNSUSINGFFTANDOAA 1 6Very Efficient Training of Convolutional 1 0 Neural Networks using Fast Fourier 2 nTransform and Overlap-and-Add a J 5TylerHighlander∗ ∗CollegeofEng. andComp. Science 2 [email protected] WrightStateUniversity Dayton,Ohio,USA ] EAndresRodriguez∗,(cid:63) (cid:63)AirForceResearchLaboratory N [email protected] Dayton,Ohio,USA . s c [ Abstract 1 v Convolutionalneuralnetworks(CNNs)arecurrentlystate-of-the-artforvariousclas- 5 sification tasks, but are computationally expensive. Propagating through the convolu- 1 tionallayersisveryslow,aseachkernelineachlayermustsequentiallycalculatemany 8 dotproductsforasingleforwardandbackwardpropagationwhichequatestoO(N2n2) 6 0 perkernelperlayerwheretheinputsareN×N arraysandthekernelsaren×narrays. . ConvolutioncanbeefficientlyperformedasaHadamardproductinthefrequencydo- 1 main. ThebottleneckisthetransformationwhichhasacostofO(N2log N)usingthe 0 2 fastFouriertransform(FFT).However,theincreaseinefficiencyislesssignificantwhen 6 N(cid:29)nasisthecaseinCNNs. Wemitigatethisbyusingthe“overlap-and-add”tech- 1 niquereducingthecomputationalcomplexitytoO(N2log n)perkernel. Thismethod : 2 v increasesthealgorithm’sefficiencyinboththeforwardandbackwardpropagation, re- i X ducingthetrainingandtestingtimeforCNNs. Ourempiricalresultsshowourmethod reduces computational time by a factor of up to 16.3 times the traditional convolution r a implementationfora8×8kernelanda224×224image. 1 Introduction Convolutionalneuralnetworks(CNNs)achievedstate-of-the-artclassificationratesonvari- ousdatasets[1,4,7],butrequiresignificantcomputationalresources. Forexample,AlexNet [5] has over 60 million free parameters trained with stochastic gradient descent requiring thousands of forward and backward propagations through a network with 5 convolutional layers. A more recent CNN, GoogLeNet [12], has various layers within layers amounting to59convolutionallayers. Propagatingthroughtheseconvolutionallayersisthecomputa- tional bottleneck of training and testing CNNs. Standard convolutional layers are slow, as eachkernelmustcalculatemanydotproductsforasingleforwardandbackwardpropagation whichequatestoO(N2n2)perkernel,wheretheinputsareN×N arraysandthekernelsare n×n arrays. To converge to a local minimum, CNNs usually require hundreds of epochs. Anepochconsistsofpropagatingallthetrainingsamplesinthedatasetthroughthenetwork once. Inaddition,itiscommontotrainmultipleCNNsforonetaskandcomputeanaverage (cid:13)c 2015.Thecopyrightofthisdocumentresideswithitsauthors. Itmaybedistributedunchangedfreelyinprintorelectronicforms. 2 HIGHLANDER,RODRIGUEZ:EFFICIENTTRAININGOFCNNSUSINGFFTANDOAA of the multiple outputs in testing. With over one million training images in the ImageNet LargeScaleVisualRecognitionChallenge(ILSVRC)[1]andhundredsofepochsneededfor training each CNN, reducing the complexity of the convolution operation reduces training time. ConvolutioncanbeefficientlycomputedinthefrequencydomainasaHadamardprod- uct. The computational bottleneck is the Fourier transform between the space and the fre- quencydomain. ForaninputofsizeN2,thiscanbeefficientlycomputedusingfastFourier transforms(FFTs)withcomplexityO(N2log N). Mathieuetal.[8]demonstratedthatthis 2 reducesthetrainingandtestingtimeofCNNs. IntheirworktheyefficientlycalculatedFFTs onaGPUandusedthesetransformstoperformconvolutionsviaaHadamardproductinthe frequencydomain. In this paper, we propose to use the overlap-and-add (OaA) technique [10] to further reducethetrainingandtestingcomplexitytoO(N2log n)perkernel. Notethattheoverlap- 2 and-save [10] is a similar technique that may be marginally faster but has the same com- plexity. In a CNN convolutional layer, the input array and the set of K kernel arrays have a depth of sizeC, e.g.,C=3 for an RBG input image. The total number of convolutions inaCNNconvolutionallayerisKC; eachchannelofeachkernelisconvolvedwiththere- spectivechannelintheinputarray. Foreachoftheseconvolutions, OaAshouldbeusedto improveefficiencywithnocostinperformance. Section2explainstheOaAtechniqueand ourconvolutionimplementation. InSection3wedemonstratethatourmethodcomputation- allyoutperforms(byafactorofupto16.3)traditionalimplementations.Weofferconcluding remarksinSection4. 2 Overlap-and-Add InOaA,theinputisbrokenintoN2/n2(roundedup)blocksequaltothekernelsizen×n. A convolution between each block and the kernel is computed and the results are overlapped andadded. Figure1illustratesasimple1-Doverlap-and-addmethodforspacialconvolution (thatcaneasilygeneralizeto2-D).Theinputarrayisfirstsplitintosmallerblocksthatare thesizeofthekernel. Smallerconvolutionsarecomputedbetweenthekernelandtheblock inputs. The resulting convolutions are overlapped by n−1, where n is the length of the kernel,andaddedtogethertocreatethesameresultsasatraditionalspacialconvolution. Figure1: 1-DOverlap-and-Addconvolutions HIGHLANDER,RODRIGUEZ:EFFICIENTTRAININGOFCNNSUSINGFFTANDOAA 3 Each convolution in OaA can be efficiently computed in the frequency domain, where thebottleneckisthecomplexityofeach2-DfastFouriertransformO(n2log n). Thetotal 2 complexityfortheentireinputandkernelisthenumberofblockstimesthecomplexityof eachblockconvolution,i.e.,O(N2log n).Table1comparesthecomplexityofeachmethod, 2 wherespaceConvreferstothetraditionalconvolutioninthespacedomain,FFTconvrefers to convolution via a Hadamard product in the frequency domain without overlap-and-add, andOaAconvreferstoconvolutionusingoverlap-and-savewhereeachsmallerconvolution isefficientlycomputedinthefrequencydomain. Method ComputationalComplexity spaceConv O(N2n2) FFTconv O(N2log N) 2 OaAconv O(N2log n) 2 Table1: ComputationalComplexityComparison WhenN(cid:29)nasisthecommoncaseinCNNarchitectures,OaAconvreducesthecompu- tationalcomplexitybyafactorof n2 overspaceConvandbyafactorof logN overFFTconv. log2n logn For example, for a 256×256 input array with a 5×5 kernel (typical values in a CNN ar- chitecture), spaceConv has a complexity of O(2562×25), FFTconv has a complexity of O(2562×8),andOaAconvhasacomplexityofO(2562×2.3). The overall time complexity of OaAconv can be further reduced by noting that all the block convolutions can be computed in parallel. If N2/n2 threads are available (a fair as- sumptionformodernGPUs),thecomplexityoneachthreadisn2log n,andtheoveralltime 2 complexityisO(max(N2,n2log n))whichisusuallyO(N2). Wecangetadditionalspeed 2 upbytakingadvantageoftheNVIDIACUDAFastFourierTransformlibrary(cuFFT)that computeseachindividualFFTsupto10timesfaster[9](seealso[13]). However,inorder tohaveafaircomparison,ourexperimentsinthispaperarerunonsinglethreads. Aspartof thiswork,wecreatedaCaffe[3]fork1thatusesamulti-threadGPUimplementationofOaA forefficientconvolutions. ApossibleareaofconcernforOaAistheadditionalcostofbreakingtheinputintoblocks andtheoverlappingandaddingcostaftereachconvolution. Ourexperimentsshowthatthe performanceincreaseoftheOaAtechniqueoutweightheseoverheadcosts. It is worth noting that although OaA always reduces the computational complexity in testing (and training), it is particularly beneficial in implementations such as Sermanet et al. [11] when the test image is much larger than the training images, and uses a sliding window approach across a pyramid of scales for simultaneous detection (localization) and classification. 3 Experiments and Results 3.1 Trainingconsistency InthisexperimentweuseeachmethodofconvolutionstotraintheCNNLeNet-5[6]archi- tectureusingtheMNIST[7]dataset. Thegoalofthisexperimentistoshowempiricallythat themethodsareequivalent. Eachnetworkistrainedforonly100epochsasallweneedto 1github.com/THighlander 4 HIGHLANDER,RODRIGUEZ:EFFICIENTTRAININGOFCNNSUSINGFFTANDOAA showisconsistency. EachCNNnetworkistrainedfivetimeswitheachtypeofconvolution technique,andtheirclassificationrateaveragesareshowninTable2. Convolutionmethod Performancerate(averagedover5networks) spaceConv 92.48% FFTconv 92.41% OaAconv 92.46% Table2: Performanceratesfornetworkstrainedwithvariousconvolutionmethods Asexpected,allthreemethodsaveragedwithin0.07%ofeachother. Thereasonthereis anon-zerodifferenceistherandomparameterinitializationintrainingeachCNN. 3.2 Timevs. numberofkernels In this experiment we compare the required total propagation time through one convolu- tionallayerasthenumberofkernelsinthelayerincreases. Tocomparecomputationaltime, note that additional channels in the CNN convolutional layer input array can be treated as amultiplicativefactorinthenumberofkernels,i.e.,thecomputationsrequiredtoconvolve aC-channels input array with a set of K-channels kernels is equivalent to convolving a 1- channelinputarraywithasetof1-channelKCkernels. Inourexperiments,theinputarray is of size 32 × 32 and each kernel is of size 5 × 5, both with 1-channel. Figure 2 shows thespeed-upfactorofFFTconvandOaAconvcomparedtospaceConvintheforwardprop- agation of the convolutional layer as the number of kernels varies. The number of kernels isvariedfrom25to750withadiscretestepof25. Each“numberofkernels”experimentis repeated10timesandtheresultsareaveraged. Figure2: Speed-upoverspaceConvvs. numberofkernelsinforwardpropagation HIGHLANDER,RODRIGUEZ:EFFICIENTTRAININGOFCNNSUSINGFFTANDOAA 5 FFTconv and OaAconv outperform spaceConv, and OaAconv outperforms FFTconv at everystep. FFTconvandOaAconvhaveanadditionalinitializationcost: inOaAtheinput array must be divided, and in both methods the input (or divided input) array and kernel mustbezero-padded,sothateachsideisN+n−1inFFTconvand2n−1inOaAconv,prior to computing the Fourier transforms. As the number of kernels increases, these additional initializationcostsbecomeslesssignificant. ToquantifytheinitializationcostofFFTconvandOaAconv,weconvolvedaninputarray of size 224 × 224 with a kernel of size 8 × 8. This experiment is repeated 10 times and averaged. OaAconv spends 1.6% of its total calculations on handling the overhead, while FFTconv spends 8.2%. In our implementation spaceConv does not require any overhead. ThelargedifferencebetweentheinputsizeandkernelsizecausesFFTconvtohavealarger overheadthanOaAconv. Figure 3 shows the speed-up over spaceConv vs. number of kernels for the backward propagation. OurmethodoutperformsFFTconvmorethanintheforwardpropagation. The reasonforthisisthatthebackwardpropagationcontainstwoactualconvolutionsperkernel: oneconvolutiontopropagatetheerrorthroughthelayerandanothertocalculatethechange inweightgeneratedbythiserror. Figure3: Speed-upoverspaceConvvs. numberofkernelsinbackwardpropagation These experiments show that the additional initialization costs of using OaAconv and FFTconvaremitigatedbythelowercomplexityofthesemethods. Themorekernelsused, thelargertheperformanceincreaseofourmethod. 3.3 Timevs. kernelsize Inthisexperimentwevarythesizeofthekernelwhilekeepingtheinputsizeconstant. The size of the input array is 64 × 64. The number of kernels used is held constant at 100. Figures 4 and 5 show the speed-up over spaceConv vs. kernel size for the forward and backwardpropagation,respectively. Thekernelsizesvaryfrom1to64withadiscretestep of1. Each“kernelsize”experimentisrepeated10timesandtheresultsareaveraged. 6 HIGHLANDER,RODRIGUEZ:EFFICIENTTRAININGOFCNNSUSINGFFTANDOAA Figure4: Speed-upoverspaceConvvs. kernelsizeinforwardpropagation Figure5: Speed-upoverspaceConvvs. kernelsizeinbackwardpropagation HIGHLANDER,RODRIGUEZ:EFFICIENTTRAININGOFCNNSUSINGFFTANDOAA 7 IntheforwardpropagationinFigure4,theperformanceofFFTconvandOaAconvcon- verge at 64 as expected. An interesting aspect of this graph is the various performance peaksatdifferentkernelsizes. ThisisbecausetheFFTsoftware[2]isoptimalforFourier transforms with power of 2 sides along each dimension. We can leverage this fact when designing CNN architectures to further reduce computational requirements, and/or we can zero-padtothenearestpowerof2tocomplementthedesignoftheFFTalgorithm. Notethat the backward propagation has peak performances at different kernel sizes due to different zero-paddingsizespriortotheFouriertransform. 3.4 Timevs. inputsize In this experiment we test the performance of multiple input sizes while holding the ker- nel size constant at 5 × 5. The input sizes varied from 4×4 to 256×256 with a discrete step of 4×4 for the forward propagation experiment and a discrete step of 8×8 for the backward propagation experiment. Different discrete steps are used to highlight the FFT algorithm’sbestperformingsizetransformsforeachpropagation. Each“inputsize”exper- imentisrepeated10timesandtheresultsareaveraged. Figures6and7showthespeed-up overspaceConvvs. inputsizefortheforwardandbackwardpropagation,respectively. Figure6: Speed-upoverspaceConvvs. inputsizeinforwardpropagation Forinputslargerthan8×8inputarraysizes(thetypicalscenarioforCNNarchitectures), OaAconvalwaysoutperformstheothermethods. Thespeed-upinthebackwardspropaga- tionismoresignificant. TheerrorpropagationisenhancedwithOaAconvasthekernelsare small. 8 HIGHLANDER,RODRIGUEZ:EFFICIENTTRAININGOFCNNSUSINGFFTANDOAA Figure7: Speed-upoverspaceConvvs. inputsizeinbackwardpropagation . 4 Conclusion InthispaperwedemonstratedthatOaAconvcanimprovetheefficiencyofCNNsovertradi- tionalconvolutionandconvolutionviaaHadamardproductinthefrequencydomain. Even thoughOaAconvmustcalculatemoretransformsthanFFTconv,thefactthatthetransforms aresmalleroutweighsthecostofhavingtocalculatemore. Inordertohavefaircomparisons we conducted our experiments on single threads. In practice OaA should be implemented suchthateachblockconvolutioniscomputedinparallel,andeachFFTisimplementedwith aGPUFFTlibrary,e.g.,cuFFT[9]toachievemaximumperformance. InfutureworkweplantooptimizetheFFTimplementationsforsmallsizetransforms, anddesignaCNNentirelyinthefrequencydomaineliminatingthebottleneckofthetrans- forms. BycreatingaCNNthatisinthefrequencydomain, theconvolutionallayerswould onlybetheHadamardproducts.Thecurrentchallengetothisistoefficientlymapanapprox- imationofthenon-lineartransformsofCNNs(e.g.,rectifiedlinearunits,sigmoidfunction, and/orhyperbolictangent)tothefrequencydomain. Acknowledgements We would like to thank Prof. Mateen Rizki of Wright State University for the helpful dis- cussions, and theAir ForceOfficeof ScientificResearch (AFOSR)forproviding themain fundingforthisworkthroughLRIR14Y06COR.Approvedforpublicrelease,casenumber: 88ABW-2015-3481 HIGHLANDER,RODRIGUEZ:EFFICIENTTRAININGOFCNNSUSINGFFTANDOAA 9 References [1] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Proceedings of IEEE Conference on ComputerVisionandPatternRecognition,pages248–255,2009. [2] MatteoFrigoandStevenGJohnson. Fftw: Anadaptivesoftwarearchitectureforthe FFT. InProceedingsofIEEEInternationalConferenceonAcoustics,SpeechandSig- nalProcessing,volume3,pages1381–1384,1998. [3] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the ACM International Conference on Multimedia,pages675–678,2014. [4] AlexKrizhevskyandGeoffreyHinton. Learningmultiplelayersoffeaturesfromtiny images.ComputerScienceDepartment,UniversityofToronto,Tech.Rep,1(4):7,2009. [5] AlexKrizhevsky,IlyaSutskever,andGeoffreyEHinton. Imagenetclassificationwith deep convolutional neural networks. In Advances in neural information processing systems,pages1097–1105,2012. [6] YannLeCunandYoshuaBengio.Convolutionalnetworksforimages,speech,andtime series. Thehandbookofbraintheoryandneuralnetworks,3361:310,1995. [7] YannLeCunandCorinnaCortes. Mnisthandwrittendigitdatabase. AT&TLabs[On- line].http://yann.lecun.com/exdb/mnist,2010. [8] Michael Mathieu, Mikael Henaff, and Yann LeCun. Fast training of convolutional networksthroughFFTs. InternationalConferenceonLearningRepresentations,2014. [9] CUDANvidia. Cufftlibrary. http://docs.nvidia.com/cuda/cufft,2010. [10] AlanVOppenheim,RonaldWSchafer,JohnRBuck,etal. Discrete-timesignalpro- cessing,volume2. Prentice-hallEnglewoodCliffs,1989. [11] PierreSermanet,DavidEigen,XiangZhang,MichaelMathieu,RobFergus,andYann LeCun. Overfeat: Integrated recognition, localization and detection using convolu- tional networks. In International Conference on Learning Representations. CBLS, April2014. [12] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov,DumitruErhan,VincentVanhoucke,andAndrewRabinovich.Goingdeeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and PatternRecognition,pages1–9,2015. [13] NicolasVasilache,JeffJohnson,MichaelMathieu,SoumithChintala,SerkanPiantino, and Yann LeCun. Fast convolutional nets with fbfft: A gpu performance evaluation. InternationalConferenceonLearningRepresentations,2015.