ebook img

On Vectorization of Deep Convolutional Neural Networks for Vision Tasks PDF

1.2 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview On Vectorization of Deep Convolutional Neural Networks for Vision Tasks

On Vectorization of Deep Convolutional Neural Networks for Vision Tasks JimmySJ. Ren Li Xu LenovoResearch&Technology http://vcnn.deeplearning.cc [email protected] [email protected] 5 1 0 Abstract processing unit (GPU), unleashes the potential power of 2 deepCNNbyscalingupthenetworksignificantly. Werecentlyhavewitnessedmanyground-breaking re- n Various infrastructures were used in scaling up sults in machine learning and computer vision, gen- a deep CNNs, including GPU (Coatesetal.2013), dis- erated by using deep convolutional neural networks J (CNN).Whilethesuccessmainlystemsfromthelarge tributed CPU based framework (Deanetal.2012), FPGA 9 volumeoftrainingdataandthedeepnetworkarchitec- (Farabetetal.2009), etc. Though the implementation 2 tures,thevectorprocessinghardware(e.g.GPU)undis- details among those approaches differ, the core in- putedly plays a vital role in modern CNN implemen- sight underlying the idea of scaling up deep CNN ] V tationstosupport massivecomputation. Thoughmuch is parallelization (BengioandLeCun2007) in which attentionwaspaidintheextentliteraturetounderstand vectorization technique is the fundamental element. C the algorithmic side of deep CNN, little research was While the consecutive distinguished performance of . dedicatedtothevectorizationforscalingupCNNs.In s GPU trained CNNs in the ImageNet visual recogni- thispaper,westudiedthevectorizationprocessofkey c tion challenge (Krizhevsky,Sutskever,andHinton2012; [ buildingblocksindeepCNNs,inordertobetterunder- standandfacilitateparallelimplementation. Keysteps Russakovskyetal.2013) as well as the reported results 1 intrainingandtestingdeepCNNsareabstractedasma- in many studies in the literature justify its effectiveness v trixandvectoroperators,uponwhichparallelismcanbe (Jiaetal.2014; Sermanetetal.2013), the published lit- 8 easilyachieved.Wedevelopedandcomparedsiximple- erature did not provide sufficient insights on how the 3 mentations with various degrees of vectorization with vectorizationwascarriedoutindetail.We alsofoundthere 3 whichweillustratedtheimpactofvectorizationonthe is no previous study to answer how different degrees of 7 speedofmodeltrainingandtesting.Besides,aunified vectorization influence the performance of deep CNN, 0 CNN framework for both high-level and low-level vi- which is, however, crucial in finding the bottlenecks and . sion tasks is provided, along with a vectorized Mat- 1 helpstoscaleupthenetworkarchitecture.Webelievethese lab implementation with state-of-the-art speed perfor- 0 questionsformasignificantresearchgapandtheanswerto mance. 5 these questionsshall shed some lighton the design, tuning 1 andimplementationofvectorizedCNNs. : Introduction v In this paper, we reinterpret the key operators in deep i Deep convolutional neural network (CNN) has be- CNNsinvectorizedformswithwhichhighparallelismcan X come a keen tool in addressing large scale artifi- beeasilyachievedgivenbasicparallelizedmatrix-vectorop- r cial intelligence tasks. Though the study of CNN erators.Toshowtheimpactofthevectorizationonthespeed a can be traced back to late 1980s (LeCunetal.1989; ofbothmodeltrainingandtesting,wedevelopedandcom- LeCunetal.1990), the recent success of deep pared six implementations of CNNs with various degrees CNN is largely attributed to the concurrent pro- of vectorization. We also provide a unified framework for gresses of the two technical streams. On the both high-leveland low-levelvision applicationsincluding one hand, the new deep CNN architecture with recognition, detection, denoise and image deconvolution. elements such as Dropout (Hintonetal.2012; OurMatlabVectorizedCNN implementation(VCNN) will Krizhevsky,Sutskever,andHinton2012), DropCon- bemadepubliclyavailableontheprojectwebpage. nect (Wanetal.2013), Rectified Linear Units-ReLU (NairandHinton2010) as well as new optimization strate- RelatedWork gies (Deanetal.2012) have empowered deep CNN with Efforts on speeding up CNN by vectorization starts with greater learning capacity. On the other hand, the rapid itsinception.SpecializedCNNchip(Jackeletal.1990)was advancesanddemocratizationofhighperformancegeneral builtandsuccessfullyappliedtohandwritingrecognitionin purpose vector processing hardware, typified by graphics theearly90s.Simardetal.(2003)simplifiedCNNbyfusing Copyright(cid:13)c 2015,AssociationfortheAdvancementofArtificial convolutionandpoolingoperations.Thisspeededupthenet- Intelligence(www.aaai.org).Allrightsreserved. workandperformedwellindocumentanalysis.Chellapilla etal.(2006)adoptedthesamearchitecturebutunrolledthe e convolution operation into a matrix-matrix product. It has nowbeenproventhatthisvectorizationapproachworkspar- ticularly well with modernGPUs. However,limited by the a d b c available computingpower,the scale of the CNN explored atthattimewasmuchsmallerthanmoderndeepCNNs. convolution pooling convolution pooling fully connected When deep architecture showed its ability to effectively learn highly complex functions Figure1:ConvolutionalNeuralNetworkarchitectureforvi- (Hinton,Osindero,andTeh2006), scaling up neural sualrecognition. networkbasedmodelswassoonbecomingoneofthemajor tasks in deep learning (BengioandLeCun2007). Vector- ization played an important role in achieving this goal. canbetypicallyexpressedas Scaling up CNN by vectorizedGPU implementationssuch as Caffe (Jiaetal.2014), Overfeat (Sermanetetal.2013), fl+1 =σ(wl ∗fl+bl), (1) i i i CudaConvnet (Krizhevsky,Sutskever,andHinton2012) and Theano (Bergstraetal.2010) generatesstate-of-the-art wherei indexesthe ith kernel.l indexesthe layer.bl is the resultsonmanyvisiontasks.Albeitthegoodperformance, biasweight.∗istheconvolutionoperator.Forvisiontasks, fewofthepreviouspaperselaboratedontheirvectorization f canbe2-or3-dimension.Theoutputsfrompreviouslayer strategies. As a consequence, how vectorization affects can be deemed as one single input fl. σ is the nonlinear designchoicesinbothmodeltrainingandtestingisunclear. functionwhichcouldbeReLU,hyperbolictangent,sigmoid, Efforts were also put in the acceleration of a part of the etc. Adding bias weight and applying nonlinear mapping deepCNNfromalgorithmicaspects,exemplifiedbythesep- are element-wise operations which can be deemed as al- arable kernels for convolution (Dentonetal.2014) and the ready fully vectorized, i.e. the whole feature vector can be FFT speedup (Mathieu,Henaff,andLeCun2013). Instead processedsimultaneously.Contrarily,theconvolutionoper- of finding a faster alternative for one specific layer, we fo- ators involve a bunch of multiplication with conflict mem- cusmoreonthegeneralvectorizationtechniquesusedinall oryaccess.Eventheoperatorsareparallizedforeachpixel, building blocks in deep CNNs, which is instrumental not theparallelism(Ragan-Kelleyetal.2013)tobeexploitedis onlyinacceleratingexistingnetworks,butalsoinproviding ratherlimited:comparedto the numberofcomputingunits guidanceforimplementinganddesigningnewCNNsacross onGPU, the numberof convolutionin one layeris usually differentplatforms,forvariousvisiontasks. smaller. A fine-grained parallelism on element-wise multi- plicationismuchpreferred,leadingtothevectorizationpro- Vectorization ofDeepCNN cesstounrolltheconvolution. In what follows, all the original data f, b and w can be Vectorizationrefersto theprocessthattransformstheorig- viewed as data vectors. Specifically, we seek vectorization inal data structure into a vector representation so that the operators ϕ () to map kernel or feature map to its matrix scalaroperatorscanbeconvertedintoavectorimplementa- c formsothatconvolutioncanbeconductedbymatrix-vector tion.Inthissection,weintroducevectorizationstrategiesfor multiplication. However, a straight forward kernel-matrix, differentlayersinDeepCNNs. image-vector product representation of convolution is not Figure 1 shows the architecture of a typical deep applicable here, since the kernel matrix is a sparse block- CNN for vision tasks. It contains all of the es- Toeplitz-Toeplitz-blockone,not suitable for parallelization sential parts of modern CNNs. Comprehensive intro- due to the existence of many zero elements. Thanks to the ductions on CNN’s general architecture and the re- duality of kernel and feature map in convolution, we can cent advances can be found in (LeCunetal.1998) and construct a dense feature-map-matrix and a kernel-vector. (Krizhevsky,Sutskever,andHinton2012). Further,multiplekernelscanbeputtogethertoforma ma- We markthe placeswherevectorizationplaysan impor- trixsoastogeneratemultiplefeaturemapoutputssimulta- tantrole.“a”istheconvolutionlayerthattransformsthein- neously, put image into feature representations, whereas “b” is the onetohandlethepoolingrelatedoperations.“c”represents [fl+1] =σ(ϕ (fl)[wl] +[bl] ). (2) theconvolutionrelatedoperationsforfeaturemaps.Wewill i i c i i i i seeshortlythatthevectorizationstrategiesbetween“a”and Operator [ ] is to assemble vectors with index i to form a i “c”areslightlydifferent.“d”involvesoperationsinthefully matrix. connected network. Finally, “e” is the vectorization opera- Backpropagation The training procedure requires the tionrequiredtosimultaneouslyprocessmultipleinputsam- backward propagation of gradients through ϕ (fl). Note c ples (e.g. mini-batch training). It is worth noting that we thatϕ (fl)isintheunrolledmatrixform,differentformthe c need to consider both forward pass and back-propagation outputs of previous layer [fl] . An inverse operator ϕ−1() i i c foralltheseoperations. isthusrequiredtotransformthematrix-formgradientsinto thevectorformforfurtherpropagation.Sinceϕ ()isaone- c VectorizingConvolution to-manymapping,ϕ−1() is a many-to-oneoperator.Fortu- c We refer to the image and intermediate feature maps as f, nately,thegradientupdateisalinearprocesswhichcanalso oneof the convolutionkernelsas w , the convolutionlayer beprocessedseparatelyandcombinedafterwards. i 1 2 3 4 5 6 7 81 2 3 4 5 6 7 8 for max pooling 1 4 7 Input 2 5 8 14 6 3 6 9 22 8 14 6 kernel1 k11 k12 k13 k14 1 2 4 5 f11 f12 f13 f14 map1 for average pooling 22 8 kernel2 k21 k22 k23 k24 2 3 5 6 = f21 f22 f23 f24 map2 kernel3 k31 k32 k33 k34 4 5 7 8 f31 f32 f33 f34 map3 Figure4:Strategytovectorizepooling.Illustrationofpool- 5 6 8 9 ingforone4x4featuremap. Figure 2: Strategy to vectorize convolution. Illustration of but does not rearrange it. It allows us to vectorize all the thewaytocovolvea3x3inputimagewiththree2x2kernels feature maps in a holistic way (c). Since only valid region andgeneratesthree2x2featuremaps. is considered during convolution, we set all the redundant columnstozeros.Thefinalϕ (f)isobtainedbyrearranging c the intermediateresult in (c). Since redundantcolumnsare 1 2 3 1 4 7 2 5 8 3 6 9 (b) (d) 1 2 4 5 involved, we use Matlab function accumarray() to handle 2 3 4 2 5 8 3 6 9 4 7 1 23 5 6 3 4 5 3 6 9 4 7 1 5 8 2 4 5 7 8 many-to-onerearrangement. 4 5 6 5 6 8 9 One may note that the operator ϕ−1() is a many-to-one 56 67 87 1 2 4 5 7 8 2 3 5 6 8 9 3 4 6 7 2343 65 76 mapping as well. So it can be efficicently implemented by 7 8 9 23 5 6 8 9 34 6 7 9 1 45 7 8 5 6 8 9 accumarray(),forbackpropagation. 8 9 1 4 5 7 8 2 3 5 6 8 9 3 4 6 7 9 1 6 7 9 1 9 1 2 5 6 8 9 3 4 6 7 9 1 4 5 7 8 1 2 3 4 6 7 VectorizingPooling (a) ( c) 45 7 8 set to zeros set to zeros 6 7 9 1 It is inefficientto carry outthe pooling separately for each 7 8 1 2 feature map, so the goal here is to simultaneously process thoseseparateoperationsbyvectorization.Thepoolingop- Figure 3: Strategy to vectorize convolution with a feature eratorcanbeabstractedas map. Illustration of the way to convolve a 3x3x3 feature map. fl+1 =σ(ϕp(fl)+bl), (3) whereϕ ()isamany-to-onemappingwithadefinedopera- p tioncorrespondingtomax-oraverage-pooling. MatlabPracticeOurmatlabimplementationtovectorize Duetotheinformationloss,theinverseoperatorϕ−1()is theinputimageisshowninFig.2.Specifically,wefirstcrop p notwelldefined.Wesimplyusenearestneighborupscaling the image patches base on the kernel size and reorganize forapproximationduringbackpropagation. themintocolumns,asindicatedbythedottedbounds.Con- The pooling operations, both average pooling and max volutionkernelsarearrangedbyrowsinanothermatrix.We pooling,canbethoughtofasavectoraccumulationprocess canseethattheproductofthesetwomatriceswillputallthe guidedbya pre-definedindexmap,whichcanbe similarly convolvedfeaturemapsin theresultingmatrix,onefeature implementedbyaccumarray().Theonlydifferenceformax map per row. ϕ () here can be efficiently implemented by c poolingisamaxfunctionisinvolved.Foroverlappingpool- the im2col() functionsin Matlab on both GPU 1 and CPU. ing,wecouldinserttheoverlappedelementsintothefeature Wenotethatthefeaturemapvectorherearetransposeoff , i mapandthenapplythesamepoolingstrategy. simplybecausethefunctionofim2col(). VectorizingFullyConnectedLayers Convolutionwiththefeaturemap Analternativepractice Thefullyconnectedlayers(i.e.“d”inFig.1),canbewritten isneededtohandleconvolutionofthefeaturemap(e.g.“c” inadensematrix-matrixmultiplication.Thusboththefeed in Fig. 1). This is because we need to first combine [fil]i forward and backpropagationare naturally vectorized,in a to a higher dimensional fl and then perform convolution. unifiedmatrix-matrixform. OneexampleisillustratedinFig.3,whereweneedtocom- bine 3 3 × 3 feature maps to a 3 × 3 × 3 one and apply Vectorization forMini-batches ϕ ().Inpractice,wecouldfirstapplythreetimesϕ ()tofl c c i Our vectorizationstrategy can be directly extendedto sup- andthencombinetheresults.Wefounditlessefficientsince portmini-batchtraining.Givena batchofsamplesindexed theactualnumberoffeaturemapismuchlarger.Toexploit by j, the mini-batch training with a convolution layer is more parallelism, we try to reorganize the data and apply givenby onlyoncethevectorizationoperatorϕ (). c InFig.3,wefirstreshapecolumnvectors[fi]i(a)backto [fil+1]i =σ([ϕc(f,lj)]j[wil]i+[bli]i), (4) a 2D matrix and put them side by side (b). This operation is cheapsince itjust changesthe representationofthe data where[]j istoassemblethematrixofdifferentsamples. Figure5showstheMatlabimplementationofbatchmode 1Weusedacustomversionofim2col()forGPU. forthesameoperationasinFig.2withthebatchsizeof2. Vecele Fu-co Conv Pool Feat Batch 7 1 4 7 8 Imp-6 X X X X X Input 2 5 8 9 Imp-5 X X X X 3 6 9 Imp-4 X X X kernel1k11 k12k13k14 1 2 4 5 1 2 4 5 f11f12f13 f14 f’11f’12f’13f’14map1 Imp-3 X X kernel2k21 k22k23k24 2 3 5 6 2 3 5 6 = f21f22f23 f24 f’21f’22f’23f’24map2 Imp-2 X X kernel3k31 k32k33k34 4 5 7 8 4 5 7 8 f31f32f33 f34 f’31f’32f’33f’34map3 Imp-1 X 5 6 8 9 5 6 8 9 feature maps feature maps for sample 1 for sample 2 sample 1 sample 2 Table 1: Different vectorized elements included in the six CNNs. Fu-co: fully connected layer, Conv: convolutional Figure5:Strategytovectorizemini-batchoperations. layer, Pool: pooling layer, Feat: feature map pachification, Batch: vectorize for batch. A tick indicates the element is vectorized Bothsamplesintheinputbatcharevectorizedandtheout- puts are arrangedhorizontally.We can show that the prod- uctofthesamekernelmatrixasinFig.2andthismatrixis to enable a fair comparison, we would like to have a abletosimultaneouslygeneratefeaturemapsforbothsam- unified vectorization scheme for convolution throughout ples. Note that if an input sample of a convolutional layer the network. Thus unlike AlexNet which uses a stride 4 has multiple channels, we could treat it as a multi-channel convolution in the first conv layer and stride 1 thereafter, featuremapasshowninFig.3. all convolution operations use the same stride of 1 in our model.Theconsequencesofthoseare,comparetoAlexNet, Experiments andAnalysis scale 2 tends to have more feature maps in the conv layer The goal of the experimentspresented in this section is to but smaller size for the input images. We set the number understand the role of vectorization in training and testing of output units to 1000 as in AlexNet. Scale 3 is a larger CNNsaswellasitslimitation. network with 10,000outputunits. This pushes the number In order to make our experimentresults of high validity oftrainableparametersto94million,keepingothersettings and relevancy, we compared the training and testing speed the same as scale 2. The performance on GPU3 (in terms of our fully vectorizedimplementationwith Caffe and Cu- ofthenumberofimagestobeprocessedpersecond)ofthe daconvnet, the speed is competitive, if not faster in all the six CNNs with different network scales during training is testedcases. illustratedintable2. We are able to observeseveralinsightsfrom the figures. ComparingDifferentDegreesofVectorization Firstofall,theresultsindicatesthatvectorizationisvitalto CNN’sspeed.WecanseethatImp-1isveryslowandanaive Inthissection,weseektounderstandtheroleofvectoriza- parallelization (Imp-2) seems work poorly on GPU. Espe- tionbycomparingsixCNNimplementations.Theseimple- ciallyforthelargescalenetworks,Imp-1andImp-2aresim- mentationsdifferinthedegreeofvectorization,whichisil- plytooslowtobepractical.Whentrainingasmallnetwork, lustrated in table 1. Imp-1 is a least vectorized one, Imp-2 a fully vectorized CNN (Imp-6) is more than 200 times naivelyparallelizetheprocessofbatchsamplesbyaddinga faster than the naive parallelization version during training parallelfor-looptoImp-1whilstImp-6isafullyvectorized and more than 100 times faster during testing. This accel- implementation guided by the approaches we introduced. eration is going to be more significant for biggernetworks Whenworkingwith oneparticularimplementationwe also sinceImp-1and2scalepoorly. observehowresultschangewithnetworkscales.Thereason is we would like to examine different vectorization strate- Second, all vectorization element we introduced con- gieswithbothsmallscalenetworkandlargescaleonesand tribute significantly to the final performance, during both weareparticularlyinterestedinlargescaleCNNssinceitis trainingandtesting.Oneinterestinginsightisthecontribu- morerelevanttorecentadvancesinthefield. tion of vectorizing pooling and feature map patchification seemstoincreasewiththescaleofthenetwork.Forinstance, We consider three scales. Scale 1 is a small model, it intable2,Imp-4(vectorizeforpooling)hasa1.9xspeedup is very similar to the standard LeNet (LeCunetal.1998), than Imp-3 under scale 1 but a 4.5x speed up and a 4.3x but with more feature maps in the convolutional lay- ers2, ReLU nonlinearity and cross-entropy error. Scale speedupunderscale2andscale3 respectively.Samephe- nomenon happens for testing. This strongly indicates that 2 is a large network with about 60 million train- thevectorizationstrategyforthosetwoelementsscaleswell able parameters which is comparable to AlexNet withthesizeofthenetwork. (Krizhevsky,Sutskever,andHinton2012). However, the architecture is tailored for the purpose of this study. Ontheotherhand,wealsoobservethatvectorizingbatch First, we would like to keep the number of conv layer and processingbringsmorethan10xspeedupforsmallmodels the number of fully connected layer balanced so that we butonly3xto5xspeedupforlargescalemodels.Thecon- shall have a fair performance breakdown. It also allows tributionofvectorizingbatchprocessingtotheperformance us to directly compare the results with Scale 1. Second, seemsto decrease whenscaling up the networkthoughthe 2ThissettingofconvlayeristhesameinCaffe’sMNISTexam- 3GeForceGTX 780 Ti,same card was used inthe rest of the ple.Wehave2fullyconnectedhiddenlayers. experiments. #img/sec Imp-1 Imp-2 Imp-3 Imp-4 Imp-5 Imp-6 Scale1 1 6.1 15.3 29.5 85.4 1312.1 Scale2 n/a n/a 2.4 11 42.3 188.7 Scale3 n/a n/a 2.3 10 34.3 161.2 ∗batchsizeis100forscale1and200forscale2and3. Table2:TrainingperformanceofthesixCNNs(#imagesto be processed per second). Scale1: small model, 10 output units;Scale2:largemodel,1000outputunits,Scale3:larger model,10000outputunits. (a) (b) (c) Figure 6: Performance break down. (a) Scale 3 network, speedupremainssignificant.Wefurtherinvestigatethisphe- batch size = 1. (b) Scale 3 network, batch size = 200. (c) nomenon in the next section which leads to a strategy to Scale3 network,batchsize = 300.conv:convlayers,pool: achieveoptimaltrainingandtestingspeed. poolinglayers,full:fullyconnectedlayers,other:otherop- InSearchofOptimalSpeed erations, f:forwardpass, b:back-propagation. Weinvestigatedthepuzzleofdeceleratingspeedupbyscru- tinizing the performance against different batch sizes. The the statistics between forward pass and back-propagation, resultsarepresentedintable3and4fortrainingandtesting therefore8componentstolookat. respectively. Figure6illustratestheperformancebreakdown(interms oftheproportionofcomputingtimeinprocessingonebatch) #img/sec b=1 b=100 b=200 b=300 b=400 during training of the two representative cases from our Scale1 88.5 1312 1450.9 1574.2 1632.8 largest network (scale 3) in the experiment. Batch size is Scale2 41.9 136.9 188.7 192.3 106.3 1 for Fig. 6(a),200 forFig. 6(b)and 300for Fig. 6(c).We Scale3 34.3 123.5 161.3 163.9 91 canobservefromFig.6(a)that44%oftheoveralltimewas Table 3: Training performance of Imp-6 against different usedinprocessingthefullyconnectedlayers.Itwasinfact batchsizes(#imagestobeprocessedpersecond). the biggest consumer of the computing time for this batch size.Wealsoseethatthetimespentonfull bissignificantly morethanfull f.Thismakessensebecauseitinvolveslarger matrix multiplicationand largertransform matrix than that #img/sec b=1 b=100 b=200 b=400 b=600 Scale1 151.5 1812.6 1878.4 2023.5 2192.2 intheforwardpass.Thesecondlargestconsumeroftimeis Scale2 75.8 222.2 270.2 285.7 103.1 theconvolutionlayersandwecanseethatthetimespentin Scale3 74 212.8 256.4 277.8 89.2 forwardpassandback-propagationisreasonablybalanced. However,thesituationwefoundinFig.6(b)andFig.6(c) Table 4: Test performanceof Imp-6 againstdifferentbatch isverydifferent.Oneobviouscharacteris,whenincreasing sizes(#imagestobeprocessedpersecond). thebatchsize,thetimecostsbytheconvlayersisnowcon- siderably more than the fully connected layers. While the Intable3,wecanseethatforthesmallmodel(scale1)the proportionbetween full f and full b amongthe three batch accelerationbroughtbyeachadjacentbatchsizeincreaseis sizesroughlyremainsthesame,wefoundconv fspentmuch 14x, 1.1x, 1.08x and 1.03x. The acceleration obtained via moretime than conv b forlargebatch sizes. Thisindicates the increase of batch size seems to be rapidly vanishing. the scaling limitation is within the conv f when vectoriz- Forthelargemodel(scale2),thefirstthreeaccelerationra- ing for batch processing. A further scrutiny on this issue tio are 3.2x, 1.3x and 1.02x, demonstrating the same van- showsthatthelimitationiscausedbythefollowingtwofac- ishing trend. Further increase in batch size even leads to a torsnamelythememoryoverheadinhandlingmultiplesam- performancedegradationinstead.Samesituationoccursfor plesandtheoverheadcausedbyinvokingpatchificationon thelargermodel(scale3).Thoughtheabilityofprocessing biggersamples. While there might be alternative strategies 192 images/secondfor training and 285 images/second for to vectorize batch processing, we argue that the aforemen- testingwithourcommodityGPUforthescale2networkis tionedoverheadishardtobecompletelyavoided. promising,thisresultstill indicatesthatthereis some scal- FindingtheOptimalSpeed. Wefoundtheobservations inglimitationwithinthevectorizationforbatchprocessing. fromFig.6are alsovalidforscale1 andscale2 networks, Similarresultsintable4seemstofurthersuggestthatsuch butwithanimportantdifference.Forsmallnetworkslikethe limitationissharedbetweentrainingandtesting.Inorderto scale1network,theaccelerationbroughtbybatchprocess- completelyunderstandtherationaleunderthehood,wehave ingshallbevalidforverybigbatchsizes(e.g.1000)whilst toresorttoadetailedperformancebreakdown. for large networks batch size needs to be chosen carefully Performance Breakdown and Limitation. We decom- or else the speed degradationlike we saw in table 3 and 4 pose the whole training procedureinto the following com- shall occur before the network hits the GPU memory ceil- ponents.Theyare1)convlayers;2)poolinglayers;3)fully ing.Thissuggeststhatgivena networkdesignchoosingan connectedlayers;4)others(e.g.ReLU,cost).Wedistinguish appropriate batch size may be vital in achieving the opti- CNNforImageProcessing Imageprocessingtasksdonotrequirepoolingandfullycon- nected layers in general. To verify the effectiveness of the proposedvectorizedframework,weimplementedanetwork architecturebysimplyremovingthepoolingandfullycon- nectedlayersfromFig.1andtrainedthenetworkwithsyn- thesized clear-noisy image pairs. One of the denoise result (a) (b) is given in Fig. 8. Another sample application of our vec- torized CNN is the recent proposed image deconvolution Figure7:Speedof10randomlyselectednetworks.X axis, (Xuetal.2014).ResultisshowninFig.9. batch size. Y axis, number of images to be processed per second.(a)fortraining.(b)fortesting. malspeed.Basedonourscale2network,weselect10other networksbyrandomlyadjustingseveralparameterssuchas filter size, numberof feature maps, numberof outputunits and sigmoid function, etc. We run these networks for both trainingandtestingbyadjustingthebatchsizestoseeifthis Figure8:Applicationinimagedenoising. contentionisgenerallyapplicableforlargenetworks. Figure7confirmsouraforementionedcontentionforlarge networksand makes the importanceof choosing an appro- priatebatchsize obvious.First, it suggeststhatthe optimal batch size among different network parameters is usually quitedifferent.Directlyadoptingabatchsizefromaprevi- oussetofnetworkparametersmaylead to significantlyin- feriorspeed.Second,italsosuggeststhattheoptimalbatch size betweenthetrainingstage andthe testingstageis also different, even if for the same network. A naive adoption of the batch size from the training stage is often not opti- malandleadstoconsiderablespeedloss.Thesefindingshas direct implications in building real-time systems in which optimizationformodeltestingisthekey. Figure9:Applicationinimagedeconvolution. Unification ofHigh/LowLevel VisionTasks Despite the rapid adoption of deep CNN in addressing NovelTrainingSchemeforMulti-objectDetection various kinds of high level computer vision tasks typified Conventionalimageclassifiersareusuallytrainedbyimage by image classification and object localization,other prob- sampleswith equalsizes. Thisimposesacriticallimitation lems such as detecting objects of different shapes in real- whenapplyingitindetection.Forinstance,itisreasonable timeseem stilla problemunderinvestigation.On theother toputahumanfacesampleinasquareimage,butdoingso hand, we observed that there are a few very recent stud- for non-squared objects (e.g. shoes) tends to include more ies(Xuetal.2014;Eigen,Krishnan,andFergus2013) suc- background content thus introduces more noise which is cessfully used deep CNN in various low level vision tasks detrimental to accurate and efficient object detection. One suchasimagedeblurringanddenoising,etc.Thoughthedo- possiblealternativeisto formulateobjectdetectionasare- mainknowledgerequiredtobuildthosenewnetworkssub- gressionproblem(Szegedy,Toshev,andErhan2013),how- stantially differ from that used in addressing high level vi- ever,itrequiresaverylargeamountofdataandusuallyvery sion tasks, same vectorization principles presented in this bigmodelstocapturethevarietyofthepossiblepatterns. paperwillapply. More interestingly, the same vectorization principle across those tasks actually gives us a chance (perhaps for the firsttime)to unifybothhighlevelvisiontasksandlow level vision tasks in a single computational framework. In thissection,weintroducetheapplicationofourVCNNim- plementation in tasks seemingly of distinct fields namely, imagedenoisinganddeblurring(lowlevelvision)aswellas Figure 10: Application in real time multi-object detection. multi-objectdetection(highlevelvision). ShoesreviewvideosarefromYoutube. UsingVCNN,wewereabletotrainasingleimageclas- volutionalnetworks. In InternationalConference on Field sifierbutwithheterogeneousinputsizesbyusingvectoriza- ProgrammableLogicandApplications,32–37. tion. The key insight is heterogeneous inputs can actually [Hintonetal.2012] Hinton, G. E.; Srivastava, N.; share all the weightsin a CNN exceptthe onesin the con- Krizhevsky, A.; Sutskever, I.; and Salakhutdinov, nectionbetweenconvlayerandfullyconnectedlayer.This R. R. 2012. Improving neural networks by prevent- approachnotonlyavoidsthebackgroundnoisebutalsobe- ing co-adaptation of feature detectors. arXiv preprint ingalotmorelightweightthantheregressionapproach.We arXiv:1207.0580. successfully applied it in a detection system which runs in [Hinton,Osindero,andTeh2006] Hinton, G.; Osindero, S.; real-time.Wecanshowthatthisapproachtendstohaveless and Teh, Y.-W. 2006. A fast learning algorithm for deep falsealarmsandworksefficientlywithmulti-scaledetection beliefnets. Neuralcomputation18(7):1527–1554. throughvectorization. [Jackeletal.1990] Jackel,L.;Boser,B.;Denker,J.;Graf,H.; Conclusion LeCun,Y.;Guyon,I.;Henderson,D.;Howard,R.;Hubbard, W.;andSolla,S. 1990. Hardwarerequirementsforneural- In this paper,we elaborateseveralaspects on vectorization netopticalcharacterrecognition.InInternationalJointCon- of deep CNN. First, we present the vectorization steps of ferenceonNeuralNetworks,855–861. all essential parts of implementing deep CNNs. The vec- [Jiaetal.2014] Jia,Y.;Shelhamer,E.;Donahue,J.;Karayev, torizationstepsarefurtherexemplifiedbyMatlabpractices. S.; Long, J.; Girshick, R.; Guadarrama, S.; and Darrell, T. Second,wehavedevelopedandcomparedsix CNN imple- 2014. Caffe:Convolutionalarchitectureforfastfeatureem- mentations with different degrees of vectorization to anal- bedding. arXivpreprintarXiv:1408.5093. ysis the impact of vectorization on speed. Third, based on the practices,we providea unified frameworkforhandling [Krizhevsky,Sutskever,andHinton2012] Krizhevsky, A.; both low-leveland high-levelvision tasks. Experimentson Sutskever, I.; and Hinton, G. E. 2012. Imagenet classifi- variousapplicationsincludingimagedenoise,decovolution cation with deep convolutional neural networks. In NIPS, and real-time object detection demonstrated the effective- 1106–1114. ness of the proposed strategies. As the introduced vector- [LeCunetal.1989] LeCun,Y.;Boser,B.;Denker,J.S.;Hen- ization techniques are generalenough, our future direction derson,D.;Howard, R. E.; Hubbard,W.; and Jackel, L. D. includes optimization for differenthardware or cloud plat- 1989. Backpropagation applied to handwritten zip code forms. recognition. Neuralcomputation1(4):541–551. [LeCunetal.1990] LeCun, Y.; Boser, B.; Denker, J. S.; References Henderson, D.; Howard, R. E.; Hubbard, W.; and Jackel, [BengioandLeCun2007] Bengio,Y., andLeCun,Y. 2007. L. D. 1990. Handwritten digit recognition with a back- Scalinglearningalgorithmstowardsai. Large-scalekernel propagationnetwork. InNIPS. machines34:1–41. [LeCunetal.1998] LeCun, Y.; Bottou, L.; Bengio, Y.; and [Bergstraetal.2010] Bergstra,J.;Breuleux,O.;Bastien, F.; Haffner,P. 1998. Gradient-basedlearning applied to doc- Lamblin,P.;Pascanu,R.;Desjardins,G.;Turian,J.;Warde- umentrecognition. Proceedingsofthe IEEE86(11):2278– Farley, D.; and Bengio, Y. 2010. Theano: a cpu and gpu 2324. mathcompilerinpython. InPythoninScienceConf,1–7. [Mathieu,Henaff,andLeCun2013] Mathieu, M.; Henaff, [Chellapillaetal.2006] Chellapilla,K.;Puri,S.;Simard,P.; M.; and LeCun, Y. 2013. Fast training of convolutional et al. 2006. High performance convolutional neural net- networksthroughffts. arXivpreprintarXiv:1312.5851. worksfordocumentprocessing. InInternationalWorkshop [NairandHinton2010] Nair, V., and Hinton, G. E. 2010. onFrontiersinHandwritingRecognition. Rectified linear units improve restricted boltzmann ma- [Coatesetal.2013] Coates, A.; Huval, B.; Wang, T.; Wu, chines. InICML,807–814. D. J.; Catanzaro, B. C.; and Ng, A. Y. 2013. Deep learn- [Ragan-Kelleyetal.2013] Ragan-Kelley, J.; Barnes, C.; ingwithCOTSHPCsystems. InICML,1337–1345. Adams,A.;Paris,S.;Durand,F.;andAmarasinghe,S.2013. [Deanetal.2012] Dean, J.; Corrado, G.; Monga, R.; Chen, Halide:alanguageandcompilerforoptimizingparallelism, K.;Devin,M.;Le,Q.V.;Mao,M.Z.;Ranzato,M.;Senior, locality, and recomputation in image processing pipelines. A.W.;Tucker,P.A.;Yang,K.;andNg,A.Y. 2012. Large ACMSIGPLANNotices48(6):519–530. scaledistributeddeepnetworks. InNIPS,1232–1240. [Russakovskyetal.2013] Russakovsky, O.; Deng, J.; [Dentonetal.2014] Denton,E.;Zaremba,W.;Bruna,J.;Le- Huang, Z.; Berg, A. C.; and Fei-Fei, L. 2013. Detecting Cun, Y.; and Fergus, R. 2014. Exploiting linear structure avocadosto zucchinis: What have we done, and where are withinconvolutionalnetworksforefficientevaluation.arXiv wegoing? InICCV,2064–2071. preprintarXiv:1404.0736. [Sermanetetal.2013] Sermanet, P.; Eigen, D.; Zhang, X.; [Eigen,Krishnan,andFergus2013] Eigen,D.;Krishnan,D.; Mathieu, M.; Fergus, R.; and LeCun, Y. 2013. Overfeat: and Fergus, R. 2013. Restoring an image taken througha Integratedrecognition,localizationanddetectionusingcon- windowcoveredwithdirtorrain. InNIPS. volutionalnetworks. arXivpreprintarXiv:1312.6229. [Farabetetal.2009] Farabet, C.; Poulet, C.; Han, J. Y.; and [Simard,Steinkraus,andPlatt2003] Simard, P. Y.; LeCun, Y. 2009. Cnp: An fpga-based processor for con- Steinkraus, D.; and Platt, J. C. 2003. Best practices for convolutional neural networks applied to visual docu- ment analysis. In International Conference on Document AnalysisandRecognition,volume2,958–958. [Szegedy,Toshev,andErhan2013] Szegedy,C.;Toshev,A.; andErhan,D. 2013. Deepneuralnetworksforobjectdetec- tion. InNIPS,2553–2561. [Wanetal.2013] Wan,L.;Zeiler,M.D.;Zhang,S.;LeCun, Y.;andFergus,R. 2013. Regularizationofneuralnetworks usingdropconnect. InICML,1058–1066. [Xuetal.2014] Xu, L.; Ren, J.; Liu, C.; and Jia, J. 2014. Deep convolutional neural network for image deconvolu- tion. InNIPS.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.