ebook img

Understanding the Effective Receptive Field in Deep Convolutional Neural Networks PDF

0.59 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Understanding the Effective Receptive Field in Deep Convolutional Neural Networks

Understanding the Effective Receptive Field in Deep Convolutional Neural Networks WenjieLuo∗ YujiaLi∗ RaquelUrtasun RichardZemel DepartmentofComputerScience UniversityofToronto 7 {wenjie, yujiali, urtasun, zemel}@cs.toronto.edu 1 0 2 Abstract n a Westudycharacteristicsofreceptivefieldsofunitsindeepconvolutionalnetworks. J Thereceptivefieldsizeisacrucialissueinmanyvisualtasks,astheoutputmust 5 respond to large enough areas in the image to capture information about large 2 objects. Weintroducethenotionofaneffectivereceptivefield,andshowthatit bothhasaGaussiandistributionandonlyoccupiesafractionofthefulltheoretical ] receptive field. We analyze the effective receptive field in several architecture V designs,andtheeffectofnonlinearactivations,dropout,sub-samplingandskip C connectionsonit. Thisleadstosuggestionsforwaystoaddressitstendencytobe . toosmall. s c [ 1 Introduction 2 v Deepconvolutionalneuralnetworks(CNNs)haveachievedgreatsuccessinawiderangeofproblems 8 2 inthelastfewyears. Inthispaperwefocusontheirapplicationtocomputervision: wheretheyare 1 thedrivingforcebehindthesignificantimprovementofthestate-of-the-artformanytasksrecently, 4 includingimagerecognition[10,8],objectdetection[17,2],semanticsegmentation[12,1],image 0 captioning[20],andmanymore. . 1 OneofthebasicconceptsindeepCNNsisthereceptivefield,orfieldofview,ofaunitinacertain 0 layerinthenetwork. Unlikeinfullyconnectednetworks,wherethevalueofeachunitdependsonthe 7 entireinputtothenetwork,aunitinconvolutionalnetworksonlydependsonaregionoftheinput. 1 Thisregionintheinputisthereceptivefieldforthatunit. : v TheconceptofreceptivefieldisimportantforunderstandinganddiagnosinghowdeepCNNswork. i X Sinceanywhereinaninputimageoutsidethereceptivefieldofaunitdoesnotaffectthevalueofthat unit,itisnecessarytocarefullycontrolthereceptivefield,toensurethatitcoverstheentirerelevant r a imageregion. Inmanytasks,especiallydensepredictiontaskslikesemanticimagesegmentation, stereoandopticalflowestimation,wherewemakeapredictionforeachsinglepixelintheinputimage, itiscriticalforeachoutputpixeltohaveabigreceptivefield,suchthatnoimportantinformationis leftoutwhenmakingtheprediction. Thereceptivefieldsizeofaunitcanbeincreasedinanumberofways. Oneoptionistostackmore layerstomakethenetworkdeeper, whichincreasesthereceptivefieldsizelinearlybytheory, as eachextralayerincreasesthereceptivefieldsizebythekernelsize. Sub-samplingontheotherhand increasesthereceptivefieldsizemultiplicatively. ModerndeepCNNarchitecturesliketheVGG networks[18]andResidualNetworks[8,6]useacombinationofthesetechniques. Inthispaper,wecarefullystudythereceptivefieldofdeepCNNs,focusingonproblemsinwhich therearemanyoutputunites.Inparticular,wediscoverthatnotallpixelsinareceptivefieldcontribute equallytoanoutputunit’sresponse. Intuitivelyitiseasytoseethatpixelsatthecenterofareceptive ∗denotesequalcontribution 29thConferenceonNeuralInformationProcessingSystems(NIPS2016),Barcelona,Spain. field have a much larger impact on an output. In the forward pass, central pixels can propagate information to the output through many different paths, while the pixels in the outer area of the receptivefieldhaveveryfewpathstopropagateitsimpact. Inthebackwardpass,gradientsfroman outputunitarepropagatedacrossallthepaths,andthereforethecentralpixelshaveamuchlarger magnitudeforthegradientfromthatoutput. Thisobservationleadsustostudyfurtherthedistributionofimpactwithinareceptivefieldonthe output. Surprisingly,wecanprovethatinmanycasesthedistributionofimpactinareceptivefield distributesasaGaussian. Notethatinearlierwork[20]thisGaussianassumptionaboutareceptive fieldisusedwithoutjustification. Thisresultfurtherleadstosomeintriguingfindings,inparticular thattheeffectiveareainthereceptivefield,whichwecalltheeffectivereceptivefield,onlyoccupiesa fractionofthetheoreticalreceptivefield,sinceGaussiandistributionsgenerallydecayquicklyfrom thecenter. Thetheorywedevelopforeffectivereceptivefieldalsocorrelateswellwithsomeempiricalobserva- tions. Onesuchempiricalobservationisthatthecurrentlycommonlyusedrandominitializations leadsomedeepCNNstostartwithasmalleffectivereceptivefield,whichthengrowsduringtraining. Thispotentiallyindicatesabadinitializationbias. BelowwepresentthetheoryinSection2andsomeempiricalobservationsinSection3,whichaim atunderstandingtheeffectivereceptivefieldfordeepCNNs. Wediscussafewpotentialwaysto increasetheeffectivereceptivefieldsizeinSection4. 2 PropertiesofEffectiveReceptiveFields Wewanttomathematicallycharacterizehowmucheachinputpixelinareceptivefieldcanimpact theoutputofaunitnlayersupthenetwork,andstudyhowtheimpactdistributeswithinthereceptive fieldofthatoutputunit. Tosimplifynotationweconsideronlyasinglechanneloneachlayer,but similarresultscanbeeasilyderivedforconvolutionallayerswithmoreinputandoutputchannels. Assumethepixelsoneachlayerareindexedby(i,j),withtheircenterat(0,0). Denotethe(i,j)th pixelonthepthlayerasxp , withx0 astheinputtothenetwork, andy = xn astheoutput i,j i,j i,j i,j on the nth layer. We want to measure how much each x0 contributes to y . We define the i,j 0,0 effectivereceptivefield(ERF)ofthiscentraloutputunitasregioncontaininganyinputpixelwitha non-negligibleimpactonthatunit. Themeasureofimpactweuseinthispaperisthepartialderivative∂y /∂x0 . Itmeasureshow 0,0 i,j much y changes as x0 changes by a small amount; it is therefore a natural measure of the 0,0 i,j importanceofx0 withrespecttoy . However,thismeasuredependsnotonlyontheweightsof i,j 0,0 thenetwork,butareinmostcasesalsoinput-dependent,somostofourresultswillbepresentedin termsofexpectationsoverinputdistribution. Thepartialderivative∂y /∂x0 canbecomputedwithback-propagation. Inthestandardsetting, 0,0 i,j back-propagationpropagatestheerrorgradientwithrespecttoacertainlossfunction. Assumingwe haveanarbitrarylossl,bythechainrulewehave ∂l = ∂l ∂yi0,j0. ∂x0i,j i0,j0 ∂yi0,j0 ∂x0i,j Thentogetthequantity∂y /∂x0 ,wecansettheerrorPgradient∂l/∂y =1and∂l/∂y =0 0,0 i,j 0,0 i,j foralli=0andj =0,thenpropagatethisgradientfromtherebackdownthenetwork. Theresulting ∂l/∂x06 equals th6e desired ∂y /∂x0 . Here we use the back-propagation process without an i,j 0,0 i,j explicitlossfunction,andtheprocesscanbeeasilyimplementedwithstandardneuralnetworktools. Inthefollowingwefirstconsiderlinearnetworks,wherethisderivativedoesnotdependontheinput andispurelyafunctionofthenetworkweightsand(i,j),whichclearlyshowshowtheimpactofthe pixelsinthereceptivefielddistributes. Thenwemoveforwardtoconsidermoremodernarchitecture designsanddiscusstheeffectofnonlinearactivations,dropout,sub-sampling,dilationconvolution andskipconnectionsontheERF. 2.1 Thesimplestcase: astackofconvolutionallayersofweightsallequaltoone Considerthecaseofnconvolutionallayersusingk kkernelswithstrideone,onesinglechannel × oneachlayerandnononlinearity,stackedintoadeeplinearCNN.Inthisanalysisweignorethe biasesonalllayers. Webeginbyanalyzingconvolutionkernelswithweightsallequaltoone. 2 Denoteg(i,j,p) = ∂l/∂xp asthegradientonthepthlayer, andletg(i,j,n) = ∂l/∂y . Then i,j i,j g(,,0)isthedesiredgradientimageoftheinput. Theback-propagationprocesseffectivelyconvolves g(,,p)withthek kkerneltogetg(,,p 1)foreachp. × − Inthisspecialcase,thekernelisak kmatrixof1’s,sothe2Dconvolutioncanbedecomposed × intotheproductoftwo1Dconvolutions. Wethereforefocusexclusivelyonthe1Dcase. Wehavethe initialgradientsignalu(t)andkernelv(t)formallydefinedas k 1 − 1, t=0 u(t)=δ(t), v(t)= δ(t m), whereδ(t)= (1) − 0, t=0 m=0 (cid:26) 6 X andt=0,1, 1,2, 2,...indexesthepixels. − − Thegradientsignalontheinputpixelsissimplyo=u v v,convolvinguwithnsuchv’s. To ∗ ∗···∗ computethisconvolution,wecanusetheDiscreteTimeFourierTransformtoconvertthesignalsinto theFourierdomain,andobtain k 1 ∞ ∞ − U(ω)= u(t)e−jωt =1, V(ω)= v(t)e−jωt = e−jωm (2) t= t= m=0 X−∞ X−∞ X Applyingtheconvolutiontheorem,wehavetheFouriertransformofois n k 1 − (o)= (u v v)(ω)=U(ω) V(ω)n = e−jωm (3) F F ∗ ∗···∗ · ! m=0 X Next,weneedtoapplytheinverseFouriertransformtogetbacko(t): n 1 π k−1 o(t)= e−jωm ejωtdω (4) 2π Z−π mX=0 ! 1 π 1, s=t 2π e−jωsejωtdω = 0, s=t (5) Z−π (cid:26) 6 n Wecanseethato(t)issimplythecoefficientofe−jωtintheexpansionof km−=10e−jωm . Casek =2: Nowlet’sconsiderthesimplestnontrivialcaseofk =2,wh(cid:16)erPe mk−=10e−j(cid:17)ωm n = (1+e jω)n. Thecoefficientfore jωtisthenthestandardbinomialcoefficien(cid:16)t n ,soo(t)=(cid:17) n . − − Pt t ItisquitewellknownthatbinomialcoefficientsdistributeswithrespecttotlikeaGaussianasn (cid:0) (cid:1) (cid:0) (cid:1) becomeslarge(seeforexample[13]),whichmeansthescaleofthecoefficientsdecaysasasquared exponentialastdeviatesfromthecenter. Whenmultiplyingtwo1DGaussiantogether,wegeta2D Gaussian,thereforeinthiscase,thegradientontheinputplaneisdistributedlikea2DGaussian. Casek >2: Inthiscasethecoefficientsareknownas“extendedbinomialcoefficients”or“polyno- mialcoefficients”,andtheytoodistributelikeGaussian,seeforexample[3,16]. Thisisincludedasa specialcaseforthemoregeneralcasepresentedlaterinSection2.3. 2.2 Randomweights Nowlet’sconsiderthecaseofrandomweights. Ingeneral,wehave k 1k 1 g(i,j,p 1)= − − wp g(i+a,i+b,p) (6) − a,b a=0b=0 XX with pixel indices properly shifted for clarity, and wp is the convolution weight at (a,b) in the a,b convolutionkernelonlayerp. Ateachlayer,theinitialweightsareindependentlydrawnfromafixed distributionwithzeromeanandvarianceC. Weassumethatthegradientsgareindependentfromthe weights. Thisassumptionisingeneralnottrueifthenetworkcontainsnonlinearities,butforlinear networkstheseassumptionshold. AsEw[wap,b]=0,wecanthencomputetheexpectation k 1k 1 Ew,input[g(i,j,p−1)]= − − Ew[wap,b]Einput[g(i+a,i+b,p)]=0, ∀p (7) a=0b=0 XX 3 Heretheexpectationistakenoverwdistributionaswellastheinputdatadistribution. Thevariance ismoreinteresting,as k 1k 1 k 1k 1 Var[g(i,j,p 1)]= − − Var[wp ]Var[g(i+a,i+b,p)]=C − − Var[g(i+a,i+b,p)] (8) − a,b a=0b=0 a=0b=0 XX XX ThisisequivalenttoconvolvingthegradientvarianceimageVar[g(,,p)]withak kconvolution × kernelfullof1’s,andthenmultiplyingbyC togetVar[g(,,p 1)]. − Based on this we can apply exactly the same analysis as in Section 2.1 on the gradient variance images. TheconclusionscarryovereasilythatVar[g(.,.,0)]hasaGaussianshape,withonlyaslight changeofhavinganextraCnconstantfactormultiplieronthevariancegradientimages,whichdoes notaffecttherelativedistributionwithinareceptivefield. 2.3 Non-uniformkernels Moregenerally, eachpixelinthekernelwindowcanhavedifferentweights, orasintherandom weightcase,theymayhavedifferentvariances. Let’sagainconsiderthe1Dcase,u(t) = δ(t)as before, and the kernel signal v(t) = k−1 w(m)δ(t m), where w(m) is the weight for the m=0 − mthpixelinthekernel. Withoutlossofgenerality,wecanassumetheweightsarenormalized,i.e. w(m)=1. P m APpplyingtheFouriertransformandconvolutiontheoremasbefore,weget n k 1 − U(ω) V(ω) V(ω)= w(m)e−jωm (9) · ··· ! m=0 X thespacedomainsignalo(t)isagainthecoefficientofe jωtintheexpansion;theonlydifferenceis − thatthee jωmtermsareweightedbyw(m). − Thesecoefficientsturnouttobewellstudiedinthecombinatoricsliterature,seeforexample[3]and thereferencesthereinformoredetails. In[3],itwasshownthatifw(m)arenormalized,theno(t) exactlyequalstotheprobabilityp(S =t),whereS = n X andX ’sarei.i.d. multinomial n n i=1 i i variables distributed according to w(m)’s, i.e. p(X = m) = w(m). Notice the analysis there i requiresthatw(m)>0. ButwecanreducetovarianceanPalysisfortherandomweightcase,where thevariancesarealwaysnonnegativewhiletheweightscanbenegative. Theanalysisfornegative w(m)ismoredifficultandislefttofuturework. Howeverempiricallywefoundtheimplicationsof theanalysisinthissectionstillappliesreasonablywelltonetworkswithnegativeweights. From the central limit theorem point of view, as n → ∞, the distribution of √n(n1Sn −E[X]) convergestoGaussian (0,Var[X])indistribution. Thismeans,foragivennlargeenough,S is n N goingtoberoughlyGaussianwithmeannE[X]andvariancenVar[X]. Aso(t)=p(Sn =t),this furtherimpliesthato(t)alsohasaGaussianshape. Whenw(m)’sarenormalized,thisGaussianhas thefollowingmeanandvariance: 2 k 1 k 1 k 1 − − − E[Sn]=n mw(m), Var[Sn]=n m2w(m) mw(m) (10)  − !  m=0 m=0 m=0 X X X   Thisindicatesthato(t)decaysfromthecenterofthereceptivefieldsquaredexponentiallyaccording totheGaussiandistribution. TherateofdecayisrelatedtothevarianceofthisGaussian. Ifwetake onestandarddeviationastheeffectivereceptivefield(ERF)sizewhichisroughlytheradiusofthe ERF,thenthissizeis Var[S ]= nVar[X ]=O(√n). n i Ontheotherhand,aswpestackmorepconvolutionallayers,thetheoreticalreceptivefieldgrowslinearly, thereforerelativetothetheoreticalreceptivefield,theERFactuallyshrinksatarateofO(1/√n), whichwefoundsurprising. Inthesimplecaseofuniformweighting,wecanfurtherseethattheERFsizegrowslinearlywith kernelsizek. Asw(m)=1/k,wehave 2 k−1 m2 k−1 m n(k2 1) Var[S ]=√n = − =O(k√n) (11) n vu k − k ! r 12 p umX=0 mX=0 t 4 Remarks: Theresultderivedinthissection,i.e.,thedistributionofimpactwithinareceptivefield indeepCNNsconvergestoGaussian,holdsunderthefollowingconditions: (1)alllayersintheCNN usethesamesetofconvolutionweights. Thisisingeneralnottrue,however,whenweapplythe analysisofvariance,theweightvarianceonalllayersareusuallythesameuptoaconstantfactor. (2) Theconvergencederivedisconvergence“indistribution”,asimpliedbythecentrallimittheorem. ThismeansthatthecumulativeprobabilitydistributionfunctionconvergestothatofaGaussian,but atanysinglepointinspacetheprobabilitycandeviatefromtheGaussian. (3)Theconvergenceresult statesthat√n(n1Sn−E[X])→N(0,Var[X]),henceSnapproachesN(nE[X],nVar[X]),however theconvergenceofSnhereisnotwelldefinedas (nE[X],nVar[X])isnotafixeddistribution,but N insteaditchangeswithn. Additionally,thedistributionofS candeviatefromGaussianonafinite n set. ButtheoverallshapeofthedistributionisstillroughlyGaussian. 2.4 Nonlinearactivationfunctions Nonlinearactivationfunctionsareanintegralpartofeveryneuralnetwork. Weuseσtorepresentan arbitrarynonlinearactivationfunction. Duringtheforwardpass,oneachlayerthepixelsarefirst passedthroughσ andthenconvolvedwiththeconvolutionkerneltocomputethenextlayer. This orderingofoperationsisalittlenon-standardbutequivalenttothemoreusualorderingofconvolving firstandpassingthroughnonlinearity,anditmakestheanalysisslightlyeasier. Thebackwardpassin thiscasebecomes k 1k 1 g(i,j,p−1)=σip,j0 − − wap,bg(i+a,i+b,p) (12) a=0b=0 XX whereweabusednotationabitanduseσp torepresentthegradientoftheactivationfunctionfor i,j0 pixel(i,j)onlayerp. For ReLU nonlinearities, σp = I[xp > 0] where I[.] is the indicator function. We have to i,j0 i,j make some extra assumptions about the activations xp to advance the analysis, in addition to i,j the assumption that it has zero mean and unit variance. A standard assumption is that xp has a i,j symmetricdistributionaround0[7]. Ifwemakeanextrasimplifyingassumptionthatthegradients σ are independent from the weights and g in the upper layers, we can simplify the variance as 0 Var[g(i,j,p−1)]=E[σip,j02] a bVar[wap,b]Var[g(i+a,i+b,p)],andE[σip,j02]=Var[σip,j0]= 1/4isaconstantfactor. Followingthevarianceanalysiswecanagainreducethiscasetotheuniform P P weightcase. SigmoidandTanhnonlinearitiesarehardertoanalyze. Hereweonlyusetheobservationthatwhen thenetworkisinitializedtheweightsareusuallysmallandthereforethesenonlinearitieswillbein thelinearregion,andthelinearanalysisapplies. However,astheweightsgrowbiggerduringtraining theireffectbecomeshardtoanalyze. 2.5 Dropout,Subsampling,DilatedConvolutionandSkip-Connections Here we consider the effect of some standard CNN approaches on the effective receptive field. Dropoutisapopulartechniquetopreventoverfitting; weshowthatdropoutdoesnotchangethe GaussianERFshape. Subsamplinganddilatedconvolutionsturnouttobeeffectivewaystoincrease receptivefieldsizequickly. Skip-connectionsontheotherhandmakeERFssmaller. Wepresentthe analysisforallthesecasesintheAppendix. 3 Experiments In this section, we empirically study the ERF for various deep CNN architectures. We first use artificiallyconstructedCNNmodelstoverifythetheoreticalresultsinouranalysis. Wethenpresent ourobservationsonhowtheERFchangesduringthetrainingofdeepCNNsonrealdatasets. Forall ERFstudies,weplaceagradientsignalof1atthecenteroftheoutputplaneand0everywhereelse, andthenback-propagatethisgradientthroughthenetworktogetinputgradients. 3.1 Verifyingtheoreticalresults WefirstverifyourtheoreticalresultsinartificiallyconstructeddeepCNNs. ForcomputingtheERF weuserandominputs,andforalltherandomweightnetworkswefollowed[7,5]forproperrandom initialization. Inthissection,weverifythefollowingresults: 5 5layers,theoreticalRFsize=11 10layers,theoreticalRFsize=21 Uniform Random Random+ReLU Uniform Random Random+ReLU 20layers,theoreticalRFsize=41 40layers,theoreticalRFsize=81 Uniform Random Random+ReLU Uniform Random Random+ReLU Figure 1: Comparing the effect of number of layers, random weight initialization and nonlinear activationontheERF.Kernelsizeisfixedat3 3forallthenetworkshere. Uniform: convolutional × kernel weights are all ones, no nonlinearity; Random: random kernel weights, no nonlinearity; Random+ReLU:randomkernelweights,ReLUnonlinearity. ERFs are Gaussian distributed: As shown in Fig. 1, we can observe perfect Gaus- sian shapes for uniformly and randomly weighted convolution kernels without nonlinear activations, and near Gaussian shapes for randomly weighted kernels with nonlinearity. AddingtheReLUnonlinearitymakesthedistributiona bitlessGaussian,astheERFdistributiondependsonthe inputaswell. AnotherreasonisthatReLUunitsoutput exactlyzeroforhalfofitsinputsanditisveryeasyto getazerooutputforthecenterpixelontheoutputplane, whichmeansnopathfromthereceptivefieldcanreach ReLU Tanh Sigmoid theoutput,hencethegradientisallzero. HeretheERFsareaveragedover20runswithdifferent randomseed. ThefiguresontherightshowstheERFfornetworkswith20layersofrandomweights, with different nonlinearities. Here the results are averaged both across 100 runs with different randomweightsaswellasdifferentrandominputs. Inthissettingthereceptivefieldsarealotmore Gaussian-like. √nabsolutegrowthand1/√nrelativeshrinkage: InFig.2,weshowthechangeofERFsizeand therelativeratioofERFovertheoreticalRFw.r.tnumberofconvolutionlayers. Thebestfittingline forERFsizegivesslopeof0.56inlogdomain,whilethelineforERFratiogivesslopeof-0.43. This indicatesERFsizeisgrowinglinearlyw.r.t√N andERFratioisshrinkinglinearlyw.r.t. 1 . Note √N hereweuse2standarddeviationsasourmeasurementforERFsize,i.e. anypixelwithvaluegreater than1 95.45%ofcenterpointisconsideredasinERF.TheERFsizeisrepresentedbythesquare − rootofnumberofpixelswithinERF,whilethetheoreticalRFsizeisthesidelengthofthesquarein whichallpixelhasanon-zeroimpactontheoutputpixel,nomatterhowsmall. Allexperimentshere areaveragedover20runs. Subsampling & dilated convolution increases receptive field: The figure on the right shows the effect of subsampling and dilated convolution. The reference baseline is a convnet with 15 dense convolution layers. Its ERF is shown in the left-most figure. We then replace 3 of the 15 convolutional layers with stride-2 convolution to get the ERF for the ‘Subsample’ figure, andreplacethemwithdilatedconvolutionwithfactor 2,4and8forthe‘Dilation’figure. Aswesee,bothof themareabletoincreasetheeffectreceptivefieldsignif- icantly. Notethe‘Dilation’figureshowsarectangular ERFshapetypicalfordilatedconvolutions. Conv-Only Subsample Dilation 3.2 HowtheERFevolvesduringtraining In this part, we take a look at how the ERF of units in the top-most convolutional layers of a classificationCNNandasemanticsegmentationCNNevolveduringtraining. Forbothtasks,we adopttheResNetarchitecturewhichmakesextensiveuseofskip-connections. Astheanalysisshows, theERFofthisnetworkshouldbesignificantlysmallerthanthetheoreticalreceptivefield. Thisis indeedwhatwehaveobservedinitially. Intriguingly,asthenetworkslearns,theERFgetsbigger,and attheendoftrainingissignificantlylargerthantheinitialERF. 6 Figure2: Absolutegrowth(left)andrelativeshrink(right)forERF CIFAR10 CamVid BeforeTraining AfterTraining BeforeTraining AfterTraining Figure3: ComparisonofERFbeforeandaftertrainingformodelstrainedonCIFAR-10classification andCamVidsemanticsegmentationtasks. CIFAR-10receptivefieldsarevisualizedintheimage spaceof32 32. × FortheclassificationtaskwetrainedaResNetwith17residualblocksontheCIFAR-10dataset. At theendoftrainingthisnetworkreachedatestaccuracyof89%. Notethatinthisexperimentwedid notusepoolingordownsampling,andexclusivelyfocusonarchitectureswithskip-connections. The accuracyofthenetworkisnotstate-of-the-artbutstillquitehigh. InFig.3weshowtheeffective receptivefieldonthe32 32imagespaceatthebeginningoftraining(withrandomlyinitialized × weights)andattheendoftrainingwhenitreachesbestvalidationaccuracy. Notethatthetheoretical receptivefieldofournetworkisactually74 74,biggerthantheimagesize,buttheERFisstillnot × abletofullyfilltheimage. Comparingtheresultsbeforeandaftertraining,weseethattheeffective receptivefieldhasgrownsignificantly. ForthesemanticsegmentationtaskweusedtheCamViddatasetforurbanscenesegmentation. We traineda“front-end”model[21]whichisapurelyconvolutionalnetworkthatpredictstheoutput ataslightlylowerresolution. ThisnetworkplaysthesameroleastheVGGnetworkdoesinmany previousworks[12]. WetrainedaResNetwith16residualblocksinterleavedwith4subsampling operationseachwithafactorof2. Duetothesesubsamplingoperationstheoutputis1/16ofthe inputsize. Forthismodel,thetheoreticalreceptivefieldofthetopconvolutionallayerunitsisquite bigat505 505. However,asshowninFig.3,theERFonlygetsafractionofthatwithadiameter × of100atthebeginningoftraining. AgainweobservethatduringtrainingtheERFsizeincreasesand attheenditreachesalmostadiameteraround150. 4 ReducetheGaussianDamage TheaboveanalysisshowsthattheERFonlytakesasmallportionofthetheoreticalreceptivefield, whichisundesirablefortasksthatrequirealargereceptivefield. NewInitialization. Onesimplewaytoincreasetheeffectivereceptivefieldistomanipulatethe initialweights. Weproposeanewrandomweightinitializationschemethatmakestheweightsatthe centeroftheconvolutionkerneltohaveasmallerscale,andtheweightsontheoutsidetobelarger; thisdiffusestheconcentrationonthecenterouttotheperiphery. Practically,wecaninitializethe networkwithanyinitializationmethod,thenscaletheweightsaccordingtoadistributionthathasa lowerscaleatthecenterandhigherscaleontheoutside. 7 In the extreme case, we can optimize the w(m)’s to maximize the ERF size or equivalently the varianceinEq.10. Solvingthisoptimizationproblemleadstothesolutionthatputweightsequallyat the4cornersoftheconvolutionkernelwhileleavingeverywhereelse0. However,usingthissolution todorandomweightinitializationistooaggressive,andleavingalotofweightsto0makeslearning slow. Asofterversionofthisideausuallyworksbetter. WehavetrainedaCNNfortheCIFAR-10classificationtaskwiththisinitializationmethod,with several random seeds. In a few cases we get a 30% speed-up of training compared to the more standardinitializations[5,7]. Butoverallthebenefitofthismethodisnotalwayssignificant. Wenotethatnomatterwhatwedotochangew(m),theeffectivereceptivefieldisstilldistributed likeaGaussiansotheaboveproposalonlysolvestheproblempartially. Architecturalchanges. ApotentiallybetterapproachistomakearchitecturalchangestotheCNNs, whichmaychangetheERFinmorefundamentalways. Forexample,insteadofconnectingeachunit inaCNNtoalocalrectangularconvolutionwindow,wecansparselyconnecteachunittoalarger areainthelowerlayerusingthesamenumberofconnections. Dilatedconvolution[21]belongsto thiscategory,butwemaypushevenfurtherandusesparseconnectionsthatarenotgrid-like. 5 Discussion Connectiontobiologicalneuralnetworks. Inouranalysiswehaveestablishedthattheeffective receptive field in deep CNNs actually grows a lot slower than we used to think. This indicates thatalotoflocalinformationisstillpreservedevenaftermanyconvolutionlayers. Thisfinding contradictssomelong-heldrelevantnotionsindeepbiologicalnetworks. Apopularcharacterization ofmammalianvisualsystemsinvolvesasplitinto"what"and"where"pathways[19]. Progressing alongthewhatorwherepathway, thereisagradualshiftinthenatureofconnectivity: receptive fieldsizesincrease, andspatialorganizationbecomeslooseruntilthereisnoobviousretinotopic organization; the loss of retinotopy means that single neurons respond to objects such as faces anywhere in the visual field [9]. However, if the ERF is smaller than the RF, this suggests that representationsmayretainpositioninformation,andalsoraisesaninterestingquestionconcerning changesinthesizeofthesefieldsduringdevelopment. Asecondrelevanteffectofouranalysisisthatitsuggeststhatconvolutionalnetworksmayautomati- callycreateaformoffovealrepresentation. Thefoveaofthehumanretinaextractshigh-resolution informationfromanimageonlyintheneighborhoodofthecentralpixel. Sub-fieldsofequalreso- lutionarearrangedsuchthattheirsizeincreaseswiththedistancefromthecenterofthefixation. Attheperipheryoftheretina,lower-resolutioninformationisextracted,fromlargerregionsofthe image. Someneuralnetworkshaveexplicitlyconstructedrepresentationsofthisform[11]. However, becauseconvolutionalnetworksformGaussianreceptivefields,theunderlyingrepresentationswill naturallyhavethischaracter. ConnectiontopreviousworkonCNNs. WhilereceptivefieldsinCNNshavenotbeenstudied extensively,[7,5]conductsimilaranalyses,intermsofcomputinghowthevarianceevolvesthrough the networks. They developed a good initialization scheme for convolution layers following the principlethatvarianceshouldnotchangemuchwhengoingthroughthenetwork. Researchershavealsoutilizedvisualizationsinordertounderstandhowneuralnetworkswork. [14] showedtheimportanceofusingnatural-imagepriorsandalsowhatanactivationoftheconvolutional layerwouldrepresent. [22]useddeconvolutionalnetstoshowtherelationofpixelsintheimageand theneuronsthatarefiring. [23]didempiricalstudyinvolvingreceptivefieldanduseditasacuefor localization. Therearealsovisualizationstudiesusinggradientascenttechniques[4]thatgenerate interestingimages,suchas[15]. Theseallfocusontheunitactivations,orfeaturemap,insteadofthe effectivereceptivefieldwhichweinvestigatehere. 6 Conclusion Inthispaper,wecarefullystudiedthereceptivefieldsindeepCNNs,andestablishedafewsurprising resultsabouttheeffectivereceptivefieldsize. Inparticular,wehaveshownthatthedistributionof impactwithinthereceptivefieldisasymptoticallyGaussian,andtheeffectivereceptivefieldonly takes up a fraction of the full theoretical receptive field. Empirical results echoed the theory we established. Webelievethisisjustthestartofthestudyofeffectivereceptivefield,whichprovidesa newangletounderstanddeepCNNs. Inthefuturewehopetostudymoreaboutwhatfactorsimpact effectivereceptivefieldinpracticeandhowwecangainmorecontroloverthem. 8 References [1] VijayBadrinarayanan,AnkurHanda,andRobertoCipolla. Segnet:Adeepconvolutionalencoder-decoder architectureforrobustsemanticpixel-wiselabelling. arXivpreprintarXiv:1505.07293,2015. [2] XiaozhiChen,KaustavKundu,YukunZhu,AndrewBerneshawi,HuiminMa,SanjaFidler,andRaquel Urtasun. 3dobjectproposalsforaccurateobjectclassdetection. InNIPS,2015. [3] SteffenEger. Restrictedweightedintegercompositionsandextendedbinomialcoefficients. Journalof IntegerSequences,16(13.1):3,2013. [4] DumitruErhan,YoshuaBengio,AaronCourville,andPascalVincent. Visualizinghigher-layerfeaturesof adeepnetwork. UniversityofMontreal,1341,2009. [5] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. InAISTATS,pages249–256,2010. [6] KaimingHe,XiangyuZhang,ShaoqingRen,andJianSun. Deepresiduallearningforimagerecognition. arXivpreprintarXiv:1512.03385,2015. [7] KaimingHe, XiangyuZhang, ShaoqingRen, andJianSun. Delvingdeepintorectifiers: Surpassing human-levelperformanceonimagenetclassification. InICCV,pages1026–1034,2015. [8] KaimingHe,XiangyuZhang,ShaoqingRen,andJianSun. Identitymappingsindeepresidualnetworks. arXivpreprintarXiv:1603.05027,2016. [9] NancyKanwisher,JoshMcDermott,andMarvinMChun. Thefusiformfacearea:amoduleinhuman extrastriatecortexspecializedforfaceperception. TheJournalofNeuroscience,17(11):4302–4311,1997. [10] AlexKrizhevsky,IlyaSutskever,andGeoffreyEHinton. Imagenetclassificationwithdeepconvolutional neuralnetworks. InNIPS,pages1097–1105,2012. [11] HugoLarochelleandGeoffreyEHinton.Learningtocombinefovealglimpseswithathird-orderboltzmann machine. InNIPS,pages1243–1251,2010. [12] JonathanLong,EvanShelhamer,andTrevorDarrell. Fullyconvolutionalnetworksforsemanticsegmenta- tion. InCVPR,pages3431–3440,2015. [13] LLovsz,JPelikn,andKVesztergombi. Discretemathematics:elementaryandbeyond,2003. [14] AravindhMahendranandAndreaVedaldi. Understandingdeepimagerepresentationsbyinvertingthem. InCVPR,pages5188–5196.IEEE,2015. [15] Alexander Mordvintsev, Christopher Olah, and Mike Tyka. Inceptionism: Going deeper into neural networks. GoogleResearchBlog.RetrievedJune,20,2015. [16] ThorstenNeuschel. Anoteonextendedbinomialcoefficients. JournalofIntegerSequences,17(2):3,2014. [17] ShaoqingRen,KaimingHe,RossGirshick,andJianSun. Fasterr-cnn:Towardsreal-timeobjectdetection withregionproposalnetworks. InNIPS,pages91–99,2015. [18] K.SimonyanandA.Zisserman. Verydeepconvolutionalnetworksforlarge-scaleimagerecognition. CoRR,abs/1409.1556,2014. [19] LeslieGUngerleiderandJamesVHaxby. ‘what’and‘where’inthehumanbrain. Currentopinionin neurobiology,4(2):157–165,1994. [20] KelvinXu,JimmyBa,RyanKiros,AaronCourville,RuslanSalakhutdinov,RichardZemel,andYoshua Bengio. Show,attendandtell: Neuralimagecaptiongenerationwithvisualattention. arXivpreprint arXiv:1502.03044,2015. [21] FisherYuandVladlenKoltun. Multi-scalecontextaggregationbydilatedconvolutions. arXivpreprint arXiv:1511.07122,2015. [22] MatthewDZeilerandRobFergus. Visualizingandunderstandingconvolutionalnetworks. InECCV, pages818–833.Springer,2014. [23] BoleiZhou,AdityaKhosla,AgataLapedriza,AudeOliva,andAntonioTorralba. Objectdetectorsemerge indeepscenecnns. arXivpreprintarXiv:1412.6856,2014. 9

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.