Table Of ContentOn Vectorization of Deep Convolutional Neural Networks for Vision Tasks
JimmySJ. Ren Li Xu
LenovoResearch&Technology
http://vcnn.deeplearning.cc
jimmy.sj.ren@gmail.com xulihk@lenovo.com
5
1
0
Abstract processing unit (GPU), unleashes the potential power of
2
deepCNNbyscalingupthenetworksignificantly.
Werecentlyhavewitnessedmanyground-breaking re-
n Various infrastructures were used in scaling up
sults in machine learning and computer vision, gen-
a deep CNNs, including GPU (Coatesetal.2013), dis-
erated by using deep convolutional neural networks
J
(CNN).Whilethesuccessmainlystemsfromthelarge tributed CPU based framework (Deanetal.2012), FPGA
9 volumeoftrainingdataandthedeepnetworkarchitec- (Farabetetal.2009), etc. Though the implementation
2
tures,thevectorprocessinghardware(e.g.GPU)undis- details among those approaches differ, the core in-
putedly plays a vital role in modern CNN implemen- sight underlying the idea of scaling up deep CNN
]
V tationstosupport massivecomputation. Thoughmuch is parallelization (BengioandLeCun2007) in which
attentionwaspaidintheextentliteraturetounderstand vectorization technique is the fundamental element.
C the algorithmic side of deep CNN, little research was
While the consecutive distinguished performance of
. dedicatedtothevectorizationforscalingupCNNs.In
s GPU trained CNNs in the ImageNet visual recogni-
thispaper,westudiedthevectorizationprocessofkey
c tion challenge (Krizhevsky,Sutskever,andHinton2012;
[ buildingblocksindeepCNNs,inordertobetterunder-
standandfacilitateparallelimplementation. Keysteps Russakovskyetal.2013) as well as the reported results
1 intrainingandtestingdeepCNNsareabstractedasma- in many studies in the literature justify its effectiveness
v trixandvectoroperators,uponwhichparallelismcanbe (Jiaetal.2014; Sermanetetal.2013), the published lit-
8 easilyachieved.Wedevelopedandcomparedsiximple- erature did not provide sufficient insights on how the
3 mentations with various degrees of vectorization with vectorizationwascarriedoutindetail.We alsofoundthere
3 whichweillustratedtheimpactofvectorizationonthe is no previous study to answer how different degrees of
7 speedofmodeltrainingandtesting.Besides,aunified vectorization influence the performance of deep CNN,
0 CNN framework for both high-level and low-level vi-
which is, however, crucial in finding the bottlenecks and
. sion tasks is provided, along with a vectorized Mat-
1 helpstoscaleupthenetworkarchitecture.Webelievethese
lab implementation with state-of-the-art speed perfor-
0 questionsformasignificantresearchgapandtheanswerto
mance.
5 these questionsshall shed some lighton the design, tuning
1 andimplementationofvectorizedCNNs.
: Introduction
v In this paper, we reinterpret the key operators in deep
i Deep convolutional neural network (CNN) has be- CNNsinvectorizedformswithwhichhighparallelismcan
X
come a keen tool in addressing large scale artifi- beeasilyachievedgivenbasicparallelizedmatrix-vectorop-
r cial intelligence tasks. Though the study of CNN erators.Toshowtheimpactofthevectorizationonthespeed
a
can be traced back to late 1980s (LeCunetal.1989; ofbothmodeltrainingandtesting,wedevelopedandcom-
LeCunetal.1990), the recent success of deep pared six implementations of CNNs with various degrees
CNN is largely attributed to the concurrent pro- of vectorization. We also provide a unified framework for
gresses of the two technical streams. On the both high-leveland low-levelvision applicationsincluding
one hand, the new deep CNN architecture with recognition, detection, denoise and image deconvolution.
elements such as Dropout (Hintonetal.2012; OurMatlabVectorizedCNN implementation(VCNN) will
Krizhevsky,Sutskever,andHinton2012), DropCon- bemadepubliclyavailableontheprojectwebpage.
nect (Wanetal.2013), Rectified Linear Units-ReLU
(NairandHinton2010) as well as new optimization strate- RelatedWork
gies (Deanetal.2012) have empowered deep CNN with
Efforts on speeding up CNN by vectorization starts with
greater learning capacity. On the other hand, the rapid
itsinception.SpecializedCNNchip(Jackeletal.1990)was
advancesanddemocratizationofhighperformancegeneral
builtandsuccessfullyappliedtohandwritingrecognitionin
purpose vector processing hardware, typified by graphics
theearly90s.Simardetal.(2003)simplifiedCNNbyfusing
Copyright(cid:13)c 2015,AssociationfortheAdvancementofArtificial convolutionandpoolingoperations.Thisspeededupthenet-
Intelligence(www.aaai.org).Allrightsreserved. workandperformedwellindocumentanalysis.Chellapilla
etal.(2006)adoptedthesamearchitecturebutunrolledthe e
convolution operation into a matrix-matrix product. It has
nowbeenproventhatthisvectorizationapproachworkspar-
ticularly well with modernGPUs. However,limited by the a d
b c
available computingpower,the scale of the CNN explored
atthattimewasmuchsmallerthanmoderndeepCNNs.
convolution pooling convolution pooling fully connected
When deep architecture showed its ability
to effectively learn highly complex functions
Figure1:ConvolutionalNeuralNetworkarchitectureforvi-
(Hinton,Osindero,andTeh2006), scaling up neural
sualrecognition.
networkbasedmodelswassoonbecomingoneofthemajor
tasks in deep learning (BengioandLeCun2007). Vector-
ization played an important role in achieving this goal.
canbetypicallyexpressedas
Scaling up CNN by vectorizedGPU implementationssuch
as Caffe (Jiaetal.2014), Overfeat (Sermanetetal.2013), fl+1 =σ(wl ∗fl+bl), (1)
i i i
CudaConvnet (Krizhevsky,Sutskever,andHinton2012)
and Theano (Bergstraetal.2010) generatesstate-of-the-art wherei indexesthe ith kernel.l indexesthe layer.bl is the
resultsonmanyvisiontasks.Albeitthegoodperformance, biasweight.∗istheconvolutionoperator.Forvisiontasks,
fewofthepreviouspaperselaboratedontheirvectorization f canbe2-or3-dimension.Theoutputsfrompreviouslayer
strategies. As a consequence, how vectorization affects can be deemed as one single input fl. σ is the nonlinear
designchoicesinbothmodeltrainingandtestingisunclear. functionwhichcouldbeReLU,hyperbolictangent,sigmoid,
Efforts were also put in the acceleration of a part of the etc. Adding bias weight and applying nonlinear mapping
deepCNNfromalgorithmicaspects,exemplifiedbythesep- are element-wise operations which can be deemed as al-
arable kernels for convolution (Dentonetal.2014) and the ready fully vectorized, i.e. the whole feature vector can be
FFT speedup (Mathieu,Henaff,andLeCun2013). Instead processedsimultaneously.Contrarily,theconvolutionoper-
of finding a faster alternative for one specific layer, we fo- ators involve a bunch of multiplication with conflict mem-
cusmoreonthegeneralvectorizationtechniquesusedinall oryaccess.Eventheoperatorsareparallizedforeachpixel,
building blocks in deep CNNs, which is instrumental not theparallelism(Ragan-Kelleyetal.2013)tobeexploitedis
onlyinacceleratingexistingnetworks,butalsoinproviding ratherlimited:comparedto the numberofcomputingunits
guidanceforimplementinganddesigningnewCNNsacross onGPU, the numberof convolutionin one layeris usually
differentplatforms,forvariousvisiontasks. smaller. A fine-grained parallelism on element-wise multi-
plicationismuchpreferred,leadingtothevectorizationpro-
Vectorization ofDeepCNN cesstounrolltheconvolution.
In what follows, all the original data f, b and w can be
Vectorizationrefersto theprocessthattransformstheorig-
viewed as data vectors. Specifically, we seek vectorization
inal data structure into a vector representation so that the
operators ϕ () to map kernel or feature map to its matrix
scalaroperatorscanbeconvertedintoavectorimplementa- c
formsothatconvolutioncanbeconductedbymatrix-vector
tion.Inthissection,weintroducevectorizationstrategiesfor
multiplication. However, a straight forward kernel-matrix,
differentlayersinDeepCNNs.
image-vector product representation of convolution is not
Figure 1 shows the architecture of a typical deep
applicable here, since the kernel matrix is a sparse block-
CNN for vision tasks. It contains all of the es-
Toeplitz-Toeplitz-blockone,not suitable for parallelization
sential parts of modern CNNs. Comprehensive intro-
due to the existence of many zero elements. Thanks to the
ductions on CNN’s general architecture and the re-
duality of kernel and feature map in convolution, we can
cent advances can be found in (LeCunetal.1998) and
construct a dense feature-map-matrix and a kernel-vector.
(Krizhevsky,Sutskever,andHinton2012).
Further,multiplekernelscanbeputtogethertoforma ma-
We markthe placeswherevectorizationplaysan impor-
trixsoastogeneratemultiplefeaturemapoutputssimulta-
tantrole.“a”istheconvolutionlayerthattransformsthein-
neously,
put image into feature representations, whereas “b” is the
onetohandlethepoolingrelatedoperations.“c”represents [fl+1] =σ(ϕ (fl)[wl] +[bl] ). (2)
theconvolutionrelatedoperationsforfeaturemaps.Wewill i i c i i i i
seeshortlythatthevectorizationstrategiesbetween“a”and Operator [ ] is to assemble vectors with index i to form a
i
“c”areslightlydifferent.“d”involvesoperationsinthefully matrix.
connected network. Finally, “e” is the vectorization opera- Backpropagation The training procedure requires the
tionrequiredtosimultaneouslyprocessmultipleinputsam- backward propagation of gradients through ϕ (fl). Note
c
ples (e.g. mini-batch training). It is worth noting that we thatϕ (fl)isintheunrolledmatrixform,differentformthe
c
need to consider both forward pass and back-propagation outputs of previous layer [fl] . An inverse operator ϕ−1()
i i c
foralltheseoperations. isthusrequiredtotransformthematrix-formgradientsinto
thevectorformforfurtherpropagation.Sinceϕ ()isaone-
c
VectorizingConvolution to-manymapping,ϕ−1() is a many-to-oneoperator.Fortu-
c
We refer to the image and intermediate feature maps as f, nately,thegradientupdateisalinearprocesswhichcanalso
oneof the convolutionkernelsas w , the convolutionlayer beprocessedseparatelyandcombinedafterwards.
i
1 2 3 4 5 6 7 81 2 3 4 5 6 7 8 for max pooling
1 4 7
Input 2 5 8 14 6
3 6 9 22 8
14 6
kernel1 k11 k12 k13 k14 1 2 4 5 f11 f12 f13 f14 map1 for average pooling 22 8
kernel2 k21 k22 k23 k24 2 3 5 6 = f21 f22 f23 f24 map2
kernel3 k31 k32 k33 k34 4 5 7 8 f31 f32 f33 f34 map3 Figure4:Strategytovectorizepooling.Illustrationofpool-
5 6 8 9 ingforone4x4featuremap.
Figure 2: Strategy to vectorize convolution. Illustration of
but does not rearrange it. It allows us to vectorize all the
thewaytocovolvea3x3inputimagewiththree2x2kernels
feature maps in a holistic way (c). Since only valid region
andgeneratesthree2x2featuremaps.
is considered during convolution, we set all the redundant
columnstozeros.Thefinalϕ (f)isobtainedbyrearranging
c
the intermediateresult in (c). Since redundantcolumnsare
1 2 3 1 4 7 2 5 8 3 6 9 (b) (d) 1 2 4 5 involved, we use Matlab function accumarray() to handle
2 3 4 2 5 8 3 6 9 4 7 1 23 5 6
3 4 5 3 6 9 4 7 1 5 8 2 4 5 7 8 many-to-onerearrangement.
4 5 6 5 6 8 9 One may note that the operator ϕ−1() is a many-to-one
56 67 87 1 2 4 5 7 8 2 3 5 6 8 9 3 4 6 7 2343 65 76 mapping as well. So it can be efficicently implemented by
7 8 9 23 5 6 8 9 34 6 7 9 1 45 7 8 5 6 8 9 accumarray(),forbackpropagation.
8 9 1 4 5 7 8 2 3 5 6 8 9 3 4 6 7 9 1 6 7 9 1
9 1 2 5 6 8 9 3 4 6 7 9 1 4 5 7 8 1 2
3 4 6 7 VectorizingPooling
(a) ( c) 45 7 8
set to zeros set to zeros
6 7 9 1 It is inefficientto carry outthe pooling separately for each
7 8 1 2 feature map, so the goal here is to simultaneously process
thoseseparateoperationsbyvectorization.Thepoolingop-
Figure 3: Strategy to vectorize convolution with a feature eratorcanbeabstractedas
map. Illustration of the way to convolve a 3x3x3 feature
map. fl+1 =σ(ϕp(fl)+bl), (3)
whereϕ ()isamany-to-onemappingwithadefinedopera-
p
tioncorrespondingtomax-oraverage-pooling.
MatlabPracticeOurmatlabimplementationtovectorize Duetotheinformationloss,theinverseoperatorϕ−1()is
theinputimageisshowninFig.2.Specifically,wefirstcrop p
notwelldefined.Wesimplyusenearestneighborupscaling
the image patches base on the kernel size and reorganize
forapproximationduringbackpropagation.
themintocolumns,asindicatedbythedottedbounds.Con-
The pooling operations, both average pooling and max
volutionkernelsarearrangedbyrowsinanothermatrix.We
pooling,canbethoughtofasavectoraccumulationprocess
canseethattheproductofthesetwomatriceswillputallthe
guidedbya pre-definedindexmap,whichcanbe similarly
convolvedfeaturemapsin theresultingmatrix,onefeature
implementedbyaccumarray().Theonlydifferenceformax
map per row. ϕ () here can be efficiently implemented by
c poolingisamaxfunctionisinvolved.Foroverlappingpool-
the im2col() functionsin Matlab on both GPU 1 and CPU.
ing,wecouldinserttheoverlappedelementsintothefeature
Wenotethatthefeaturemapvectorherearetransposeoff ,
i mapandthenapplythesamepoolingstrategy.
simplybecausethefunctionofim2col().
VectorizingFullyConnectedLayers
Convolutionwiththefeaturemap Analternativepractice Thefullyconnectedlayers(i.e.“d”inFig.1),canbewritten
isneededtohandleconvolutionofthefeaturemap(e.g.“c” inadensematrix-matrixmultiplication.Thusboththefeed
in Fig. 1). This is because we need to first combine [fil]i forward and backpropagationare naturally vectorized,in a
to a higher dimensional fl and then perform convolution. unifiedmatrix-matrixform.
OneexampleisillustratedinFig.3,whereweneedtocom-
bine 3 3 × 3 feature maps to a 3 × 3 × 3 one and apply Vectorization forMini-batches
ϕ ().Inpractice,wecouldfirstapplythreetimesϕ ()tofl
c c i Our vectorizationstrategy can be directly extendedto sup-
andthencombinetheresults.Wefounditlessefficientsince
portmini-batchtraining.Givena batchofsamplesindexed
theactualnumberoffeaturemapismuchlarger.Toexploit
by j, the mini-batch training with a convolution layer is
more parallelism, we try to reorganize the data and apply
givenby
onlyoncethevectorizationoperatorϕ ().
c
InFig.3,wefirstreshapecolumnvectors[fi]i(a)backto [fil+1]i =σ([ϕc(f,lj)]j[wil]i+[bli]i), (4)
a 2D matrix and put them side by side (b). This operation
is cheapsince itjust changesthe representationofthe data where[]j istoassemblethematrixofdifferentsamples.
Figure5showstheMatlabimplementationofbatchmode
1Weusedacustomversionofim2col()forGPU. forthesameoperationasinFig.2withthebatchsizeof2.
Vecele Fu-co Conv Pool Feat Batch
7
1 4 7 8 Imp-6 X X X X X
Input 2 5 8 9 Imp-5 X X X X
3 6 9
Imp-4 X X X
kernel1k11 k12k13k14 1 2 4 5 1 2 4 5 f11f12f13 f14 f’11f’12f’13f’14map1 Imp-3 X X
kernel2k21 k22k23k24 2 3 5 6 2 3 5 6 = f21f22f23 f24 f’21f’22f’23f’24map2 Imp-2 X X
kernel3k31 k32k33k34 4 5 7 8 4 5 7 8 f31f32f33 f34 f’31f’32f’33f’34map3 Imp-1 X
5 6 8 9 5 6 8 9 feature maps feature maps
for sample 1 for sample 2
sample 1 sample 2
Table 1: Different vectorized elements included in the six
CNNs. Fu-co: fully connected layer, Conv: convolutional
Figure5:Strategytovectorizemini-batchoperations.
layer, Pool: pooling layer, Feat: feature map pachification,
Batch: vectorize for batch. A tick indicates the element is
vectorized
Bothsamplesintheinputbatcharevectorizedandtheout-
puts are arrangedhorizontally.We can show that the prod-
uctofthesamekernelmatrixasinFig.2andthismatrixis
to enable a fair comparison, we would like to have a
abletosimultaneouslygeneratefeaturemapsforbothsam-
unified vectorization scheme for convolution throughout
ples. Note that if an input sample of a convolutional layer
the network. Thus unlike AlexNet which uses a stride 4
has multiple channels, we could treat it as a multi-channel
convolution in the first conv layer and stride 1 thereafter,
featuremapasshowninFig.3.
all convolution operations use the same stride of 1 in our
model.Theconsequencesofthoseare,comparetoAlexNet,
Experiments andAnalysis
scale 2 tends to have more feature maps in the conv layer
The goal of the experimentspresented in this section is to but smaller size for the input images. We set the number
understand the role of vectorization in training and testing of output units to 1000 as in AlexNet. Scale 3 is a larger
CNNsaswellasitslimitation. network with 10,000outputunits. This pushes the number
In order to make our experimentresults of high validity oftrainableparametersto94million,keepingothersettings
and relevancy, we compared the training and testing speed the same as scale 2. The performance on GPU3 (in terms
of our fully vectorizedimplementationwith Caffe and Cu- ofthenumberofimagestobeprocessedpersecond)ofthe
daconvnet, the speed is competitive, if not faster in all the six CNNs with different network scales during training is
testedcases. illustratedintable2.
We are able to observeseveralinsightsfrom the figures.
ComparingDifferentDegreesofVectorization Firstofall,theresultsindicatesthatvectorizationisvitalto
CNN’sspeed.WecanseethatImp-1isveryslowandanaive
Inthissection,weseektounderstandtheroleofvectoriza-
parallelization (Imp-2) seems work poorly on GPU. Espe-
tionbycomparingsixCNNimplementations.Theseimple-
ciallyforthelargescalenetworks,Imp-1andImp-2aresim-
mentationsdifferinthedegreeofvectorization,whichisil-
plytooslowtobepractical.Whentrainingasmallnetwork,
lustrated in table 1. Imp-1 is a least vectorized one, Imp-2
a fully vectorized CNN (Imp-6) is more than 200 times
naivelyparallelizetheprocessofbatchsamplesbyaddinga
faster than the naive parallelization version during training
parallelfor-looptoImp-1whilstImp-6isafullyvectorized
and more than 100 times faster during testing. This accel-
implementation guided by the approaches we introduced.
eration is going to be more significant for biggernetworks
Whenworkingwith oneparticularimplementationwe also
sinceImp-1and2scalepoorly.
observehowresultschangewithnetworkscales.Thereason
is we would like to examine different vectorization strate- Second, all vectorization element we introduced con-
gieswithbothsmallscalenetworkandlargescaleonesand tribute significantly to the final performance, during both
weareparticularlyinterestedinlargescaleCNNssinceitis trainingandtesting.Oneinterestinginsightisthecontribu-
morerelevanttorecentadvancesinthefield. tion of vectorizing pooling and feature map patchification
seemstoincreasewiththescaleofthenetwork.Forinstance,
We consider three scales. Scale 1 is a small model, it
intable2,Imp-4(vectorizeforpooling)hasa1.9xspeedup
is very similar to the standard LeNet (LeCunetal.1998),
than Imp-3 under scale 1 but a 4.5x speed up and a 4.3x
but with more feature maps in the convolutional lay-
ers2, ReLU nonlinearity and cross-entropy error. Scale speedupunderscale2andscale3 respectively.Samephe-
nomenon happens for testing. This strongly indicates that
2 is a large network with about 60 million train-
thevectorizationstrategyforthosetwoelementsscaleswell
able parameters which is comparable to AlexNet
withthesizeofthenetwork.
(Krizhevsky,Sutskever,andHinton2012). However,
the architecture is tailored for the purpose of this study. Ontheotherhand,wealsoobservethatvectorizingbatch
First, we would like to keep the number of conv layer and processingbringsmorethan10xspeedupforsmallmodels
the number of fully connected layer balanced so that we butonly3xto5xspeedupforlargescalemodels.Thecon-
shall have a fair performance breakdown. It also allows tributionofvectorizingbatchprocessingtotheperformance
us to directly compare the results with Scale 1. Second, seemsto decrease whenscaling up the networkthoughthe
2ThissettingofconvlayeristhesameinCaffe’sMNISTexam- 3GeForceGTX 780 Ti,same card was used inthe rest of the
ple.Wehave2fullyconnectedhiddenlayers. experiments.
#img/sec Imp-1 Imp-2 Imp-3 Imp-4 Imp-5 Imp-6
Scale1 1 6.1 15.3 29.5 85.4 1312.1
Scale2 n/a n/a 2.4 11 42.3 188.7
Scale3 n/a n/a 2.3 10 34.3 161.2
∗batchsizeis100forscale1and200forscale2and3.
Table2:TrainingperformanceofthesixCNNs(#imagesto
be processed per second). Scale1: small model, 10 output
units;Scale2:largemodel,1000outputunits,Scale3:larger
model,10000outputunits.
(a) (b) (c)
Figure 6: Performance break down. (a) Scale 3 network,
speedupremainssignificant.Wefurtherinvestigatethisphe-
batch size = 1. (b) Scale 3 network, batch size = 200. (c)
nomenon in the next section which leads to a strategy to
Scale3 network,batchsize = 300.conv:convlayers,pool:
achieveoptimaltrainingandtestingspeed.
poolinglayers,full:fullyconnectedlayers,other:otherop-
InSearchofOptimalSpeed erations, f:forwardpass, b:back-propagation.
Weinvestigatedthepuzzleofdeceleratingspeedupbyscru-
tinizing the performance against different batch sizes. The
the statistics between forward pass and back-propagation,
resultsarepresentedintable3and4fortrainingandtesting
therefore8componentstolookat.
respectively.
Figure6illustratestheperformancebreakdown(interms
oftheproportionofcomputingtimeinprocessingonebatch)
#img/sec b=1 b=100 b=200 b=300 b=400
during training of the two representative cases from our
Scale1 88.5 1312 1450.9 1574.2 1632.8
largest network (scale 3) in the experiment. Batch size is
Scale2 41.9 136.9 188.7 192.3 106.3
1 for Fig. 6(a),200 forFig. 6(b)and 300for Fig. 6(c).We
Scale3 34.3 123.5 161.3 163.9 91
canobservefromFig.6(a)that44%oftheoveralltimewas
Table 3: Training performance of Imp-6 against different usedinprocessingthefullyconnectedlayers.Itwasinfact
batchsizes(#imagestobeprocessedpersecond). the biggest consumer of the computing time for this batch
size.Wealsoseethatthetimespentonfull bissignificantly
morethanfull f.Thismakessensebecauseitinvolveslarger
matrix multiplicationand largertransform matrix than that
#img/sec b=1 b=100 b=200 b=400 b=600
Scale1 151.5 1812.6 1878.4 2023.5 2192.2 intheforwardpass.Thesecondlargestconsumeroftimeis
Scale2 75.8 222.2 270.2 285.7 103.1 theconvolutionlayersandwecanseethatthetimespentin
Scale3 74 212.8 256.4 277.8 89.2 forwardpassandback-propagationisreasonablybalanced.
However,thesituationwefoundinFig.6(b)andFig.6(c)
Table 4: Test performanceof Imp-6 againstdifferentbatch isverydifferent.Oneobviouscharacteris,whenincreasing
sizes(#imagestobeprocessedpersecond). thebatchsize,thetimecostsbytheconvlayersisnowcon-
siderably more than the fully connected layers. While the
Intable3,wecanseethatforthesmallmodel(scale1)the proportionbetween full f and full b amongthe three batch
accelerationbroughtbyeachadjacentbatchsizeincreaseis sizesroughlyremainsthesame,wefoundconv fspentmuch
14x, 1.1x, 1.08x and 1.03x. The acceleration obtained via moretime than conv b forlargebatch sizes. Thisindicates
the increase of batch size seems to be rapidly vanishing. the scaling limitation is within the conv f when vectoriz-
Forthelargemodel(scale2),thefirstthreeaccelerationra- ing for batch processing. A further scrutiny on this issue
tio are 3.2x, 1.3x and 1.02x, demonstrating the same van- showsthatthelimitationiscausedbythefollowingtwofac-
ishing trend. Further increase in batch size even leads to a torsnamelythememoryoverheadinhandlingmultiplesam-
performancedegradationinstead.Samesituationoccursfor plesandtheoverheadcausedbyinvokingpatchificationon
thelargermodel(scale3).Thoughtheabilityofprocessing biggersamples. While there might be alternative strategies
192 images/secondfor training and 285 images/second for to vectorize batch processing, we argue that the aforemen-
testingwithourcommodityGPUforthescale2networkis tionedoverheadishardtobecompletelyavoided.
promising,thisresultstill indicatesthatthereis some scal-
FindingtheOptimalSpeed. Wefoundtheobservations
inglimitationwithinthevectorizationforbatchprocessing.
fromFig.6are alsovalidforscale1 andscale2 networks,
Similarresultsintable4seemstofurthersuggestthatsuch
butwithanimportantdifference.Forsmallnetworkslikethe
limitationissharedbetweentrainingandtesting.Inorderto
scale1network,theaccelerationbroughtbybatchprocess-
completelyunderstandtherationaleunderthehood,wehave
ingshallbevalidforverybigbatchsizes(e.g.1000)whilst
toresorttoadetailedperformancebreakdown.
for large networks batch size needs to be chosen carefully
Performance Breakdown and Limitation. We decom- or else the speed degradationlike we saw in table 3 and 4
pose the whole training procedureinto the following com- shall occur before the network hits the GPU memory ceil-
ponents.Theyare1)convlayers;2)poolinglayers;3)fully ing.Thissuggeststhatgivena networkdesignchoosingan
connectedlayers;4)others(e.g.ReLU,cost).Wedistinguish appropriate batch size may be vital in achieving the opti-
CNNforImageProcessing
Imageprocessingtasksdonotrequirepoolingandfullycon-
nected layers in general. To verify the effectiveness of the
proposedvectorizedframework,weimplementedanetwork
architecturebysimplyremovingthepoolingandfullycon-
nectedlayersfromFig.1andtrainedthenetworkwithsyn-
thesized clear-noisy image pairs. One of the denoise result
(a) (b) is given in Fig. 8. Another sample application of our vec-
torized CNN is the recent proposed image deconvolution
Figure7:Speedof10randomlyselectednetworks.X axis, (Xuetal.2014).ResultisshowninFig.9.
batch size. Y axis, number of images to be processed per
second.(a)fortraining.(b)fortesting.
malspeed.Basedonourscale2network,weselect10other
networksbyrandomlyadjustingseveralparameterssuchas
filter size, numberof feature maps, numberof outputunits
and sigmoid function, etc. We run these networks for both
trainingandtestingbyadjustingthebatchsizestoseeifthis
Figure8:Applicationinimagedenoising.
contentionisgenerallyapplicableforlargenetworks.
Figure7confirmsouraforementionedcontentionforlarge
networksand makes the importanceof choosing an appro-
priatebatchsize obvious.First, it suggeststhatthe optimal
batch size among different network parameters is usually
quitedifferent.Directlyadoptingabatchsizefromaprevi-
oussetofnetworkparametersmaylead to significantlyin-
feriorspeed.Second,italsosuggeststhattheoptimalbatch
size betweenthetrainingstage andthe testingstageis also
different, even if for the same network. A naive adoption
of the batch size from the training stage is often not opti-
malandleadstoconsiderablespeedloss.Thesefindingshas
direct implications in building real-time systems in which
optimizationformodeltestingisthekey.
Figure9:Applicationinimagedeconvolution.
Unification ofHigh/LowLevel VisionTasks
Despite the rapid adoption of deep CNN in addressing NovelTrainingSchemeforMulti-objectDetection
various kinds of high level computer vision tasks typified
Conventionalimageclassifiersareusuallytrainedbyimage
by image classification and object localization,other prob-
sampleswith equalsizes. Thisimposesacriticallimitation
lems such as detecting objects of different shapes in real-
whenapplyingitindetection.Forinstance,itisreasonable
timeseem stilla problemunderinvestigation.On theother
toputahumanfacesampleinasquareimage,butdoingso
hand, we observed that there are a few very recent stud-
for non-squared objects (e.g. shoes) tends to include more
ies(Xuetal.2014;Eigen,Krishnan,andFergus2013) suc-
background content thus introduces more noise which is
cessfully used deep CNN in various low level vision tasks
detrimental to accurate and efficient object detection. One
suchasimagedeblurringanddenoising,etc.Thoughthedo-
possiblealternativeisto formulateobjectdetectionasare-
mainknowledgerequiredtobuildthosenewnetworkssub-
gressionproblem(Szegedy,Toshev,andErhan2013),how-
stantially differ from that used in addressing high level vi-
ever,itrequiresaverylargeamountofdataandusuallyvery
sion tasks, same vectorization principles presented in this
bigmodelstocapturethevarietyofthepossiblepatterns.
paperwillapply.
More interestingly, the same vectorization principle
across those tasks actually gives us a chance (perhaps for
the firsttime)to unifybothhighlevelvisiontasksandlow
level vision tasks in a single computational framework. In
thissection,weintroducetheapplicationofourVCNNim-
plementation in tasks seemingly of distinct fields namely,
imagedenoisinganddeblurring(lowlevelvision)aswellas Figure 10: Application in real time multi-object detection.
multi-objectdetection(highlevelvision). ShoesreviewvideosarefromYoutube.
UsingVCNN,wewereabletotrainasingleimageclas- volutionalnetworks. In InternationalConference on Field
sifierbutwithheterogeneousinputsizesbyusingvectoriza- ProgrammableLogicandApplications,32–37.
tion. The key insight is heterogeneous inputs can actually [Hintonetal.2012] Hinton, G. E.; Srivastava, N.;
share all the weightsin a CNN exceptthe onesin the con- Krizhevsky, A.; Sutskever, I.; and Salakhutdinov,
nectionbetweenconvlayerandfullyconnectedlayer.This R. R. 2012. Improving neural networks by prevent-
approachnotonlyavoidsthebackgroundnoisebutalsobe- ing co-adaptation of feature detectors. arXiv preprint
ingalotmorelightweightthantheregressionapproach.We arXiv:1207.0580.
successfully applied it in a detection system which runs in
[Hinton,Osindero,andTeh2006] Hinton, G.; Osindero, S.;
real-time.Wecanshowthatthisapproachtendstohaveless
and Teh, Y.-W. 2006. A fast learning algorithm for deep
falsealarmsandworksefficientlywithmulti-scaledetection
beliefnets. Neuralcomputation18(7):1527–1554.
throughvectorization.
[Jackeletal.1990] Jackel,L.;Boser,B.;Denker,J.;Graf,H.;
Conclusion LeCun,Y.;Guyon,I.;Henderson,D.;Howard,R.;Hubbard,
W.;andSolla,S. 1990. Hardwarerequirementsforneural-
In this paper,we elaborateseveralaspects on vectorization
netopticalcharacterrecognition.InInternationalJointCon-
of deep CNN. First, we present the vectorization steps of
ferenceonNeuralNetworks,855–861.
all essential parts of implementing deep CNNs. The vec-
[Jiaetal.2014] Jia,Y.;Shelhamer,E.;Donahue,J.;Karayev,
torizationstepsarefurtherexemplifiedbyMatlabpractices.
S.; Long, J.; Girshick, R.; Guadarrama, S.; and Darrell, T.
Second,wehavedevelopedandcomparedsix CNN imple-
2014. Caffe:Convolutionalarchitectureforfastfeatureem-
mentations with different degrees of vectorization to anal-
bedding. arXivpreprintarXiv:1408.5093.
ysis the impact of vectorization on speed. Third, based on
the practices,we providea unified frameworkforhandling [Krizhevsky,Sutskever,andHinton2012] Krizhevsky, A.;
both low-leveland high-levelvision tasks. Experimentson Sutskever, I.; and Hinton, G. E. 2012. Imagenet classifi-
variousapplicationsincludingimagedenoise,decovolution cation with deep convolutional neural networks. In NIPS,
and real-time object detection demonstrated the effective- 1106–1114.
ness of the proposed strategies. As the introduced vector- [LeCunetal.1989] LeCun,Y.;Boser,B.;Denker,J.S.;Hen-
ization techniques are generalenough, our future direction derson,D.;Howard, R. E.; Hubbard,W.; and Jackel, L. D.
includes optimization for differenthardware or cloud plat- 1989. Backpropagation applied to handwritten zip code
forms. recognition. Neuralcomputation1(4):541–551.
[LeCunetal.1990] LeCun, Y.; Boser, B.; Denker, J. S.;
References
Henderson, D.; Howard, R. E.; Hubbard, W.; and Jackel,
[BengioandLeCun2007] Bengio,Y., andLeCun,Y. 2007. L. D. 1990. Handwritten digit recognition with a back-
Scalinglearningalgorithmstowardsai. Large-scalekernel propagationnetwork. InNIPS.
machines34:1–41. [LeCunetal.1998] LeCun, Y.; Bottou, L.; Bengio, Y.; and
[Bergstraetal.2010] Bergstra,J.;Breuleux,O.;Bastien, F.; Haffner,P. 1998. Gradient-basedlearning applied to doc-
Lamblin,P.;Pascanu,R.;Desjardins,G.;Turian,J.;Warde- umentrecognition. Proceedingsofthe IEEE86(11):2278–
Farley, D.; and Bengio, Y. 2010. Theano: a cpu and gpu 2324.
mathcompilerinpython. InPythoninScienceConf,1–7. [Mathieu,Henaff,andLeCun2013] Mathieu, M.; Henaff,
[Chellapillaetal.2006] Chellapilla,K.;Puri,S.;Simard,P.; M.; and LeCun, Y. 2013. Fast training of convolutional
et al. 2006. High performance convolutional neural net- networksthroughffts. arXivpreprintarXiv:1312.5851.
worksfordocumentprocessing. InInternationalWorkshop [NairandHinton2010] Nair, V., and Hinton, G. E. 2010.
onFrontiersinHandwritingRecognition. Rectified linear units improve restricted boltzmann ma-
[Coatesetal.2013] Coates, A.; Huval, B.; Wang, T.; Wu, chines. InICML,807–814.
D. J.; Catanzaro, B. C.; and Ng, A. Y. 2013. Deep learn- [Ragan-Kelleyetal.2013] Ragan-Kelley, J.; Barnes, C.;
ingwithCOTSHPCsystems. InICML,1337–1345. Adams,A.;Paris,S.;Durand,F.;andAmarasinghe,S.2013.
[Deanetal.2012] Dean, J.; Corrado, G.; Monga, R.; Chen, Halide:alanguageandcompilerforoptimizingparallelism,
K.;Devin,M.;Le,Q.V.;Mao,M.Z.;Ranzato,M.;Senior, locality, and recomputation in image processing pipelines.
A.W.;Tucker,P.A.;Yang,K.;andNg,A.Y. 2012. Large ACMSIGPLANNotices48(6):519–530.
scaledistributeddeepnetworks. InNIPS,1232–1240. [Russakovskyetal.2013] Russakovsky, O.; Deng, J.;
[Dentonetal.2014] Denton,E.;Zaremba,W.;Bruna,J.;Le- Huang, Z.; Berg, A. C.; and Fei-Fei, L. 2013. Detecting
Cun, Y.; and Fergus, R. 2014. Exploiting linear structure avocadosto zucchinis: What have we done, and where are
withinconvolutionalnetworksforefficientevaluation.arXiv wegoing? InICCV,2064–2071.
preprintarXiv:1404.0736. [Sermanetetal.2013] Sermanet, P.; Eigen, D.; Zhang, X.;
[Eigen,Krishnan,andFergus2013] Eigen,D.;Krishnan,D.; Mathieu, M.; Fergus, R.; and LeCun, Y. 2013. Overfeat:
and Fergus, R. 2013. Restoring an image taken througha Integratedrecognition,localizationanddetectionusingcon-
windowcoveredwithdirtorrain. InNIPS. volutionalnetworks. arXivpreprintarXiv:1312.6229.
[Farabetetal.2009] Farabet, C.; Poulet, C.; Han, J. Y.; and [Simard,Steinkraus,andPlatt2003] Simard, P. Y.;
LeCun, Y. 2009. Cnp: An fpga-based processor for con- Steinkraus, D.; and Platt, J. C. 2003. Best practices
for convolutional neural networks applied to visual docu-
ment analysis. In International Conference on Document
AnalysisandRecognition,volume2,958–958.
[Szegedy,Toshev,andErhan2013] Szegedy,C.;Toshev,A.;
andErhan,D. 2013. Deepneuralnetworksforobjectdetec-
tion. InNIPS,2553–2561.
[Wanetal.2013] Wan,L.;Zeiler,M.D.;Zhang,S.;LeCun,
Y.;andFergus,R. 2013. Regularizationofneuralnetworks
usingdropconnect. InICML,1058–1066.
[Xuetal.2014] Xu, L.; Ren, J.; Liu, C.; and Jia, J. 2014.
Deep convolutional neural network for image deconvolu-
tion. InNIPS.