ebook img

ESPNet: Efficient Spatial Pyramid of Dilated Convolutions for Semantic Segmentation PDF

17 Pages·1998·2.88 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview ESPNet: Efficient Spatial Pyramid of Dilated Convolutions for Semantic Segmentation

ESPNet: Efficient Spatial Pyramid of Dilated Convolutions for Semantic Segmentation SachinMehta1[0000−0002−5420−4725],MohammadRastegari2,AnatCaspi1, LindaShapiro1,andHannanehHajishirzi1 1 UniversityofWashington,Seattle,WA,USA {sacmehta, caspian, shapiro, hannaneh}@cs.washington.edu 2 AllenInstituteforAIandXNOR.AI,Seattle,WA,USA [email protected] Abstract. We introduce a fast and efficient convolutional neural network, ES- PNet,forsemanticsegmentationofhighresolutionimagesunderresourcecon- straints.ESPNetisbasedonanewconvolutionalmodule,efficientspatialpyra- mid(ESP),whichisefficientintermsofcomputation,memory,andpower.ES- PNet is 22 times faster (on a standard GPU) and 180 times smaller than the state-of-the-artsemanticsegmentationnetworkPSPNet,whileitscategory-wise accuracyisonly8%less.WeevaluatedESPNetonavarietyofsemanticsegmen- tationdatasetsincludingCityscapes,PASCALVOC,andabreastbiopsywhole slide image dataset. Under the same constraints on memory and computation, ESPNetoutperformsallthecurrentefficientCNNnetworkssuchasMobileNet, ShuffleNet,andENetonbothstandardmetricsandournewlyintroducedperfor- mancemetricsthatmeasureefficiencyonedgedevices.Ournetworkcanprocess high resolution images at a rate of 112 and 9 frames per second on a standard GPU and edge device, respectively. Our code is open-source and available at https://sacmehta.github.io/ESPNet/. 1 Introduction Deepconvolutionalneuralnetwork(CNN)modelshaveachievedhighaccuracyinvi- sual scene understanding tasks [1–3]. While the accuracy of these networks has im- proved with their increase in depth and width, large networks are slow and power hungry. This is especially problematic on the computationally heavy task of seman- tic segmentation [4–10]. For example, PSPNet [1] has 65.7 million parameters and runs at about 1 FPS while discharging the battery of a standard laptop at a rate of 77 Watts. Many advanced real-world applications, such as self-driving cars, robots, and augmentedreality,aresensitiveanddemandon-lineprocessingofdatalocallyonedge devices. These accurate networks require enormous resources and are not suitable for edgedevices,whichhavelimitedenergyoverhead,restrictivememoryconstraints,and reducedcomputationalcapabilities. Convolution factorization has demonstrated its success in reducing the computa- tionalcomplexityofdeepCNNs[11–15].Weintroduceanefficientconvolutionalmod- ule, ESP (efficient spatial pyramid), which is based on the convolutional factorization 2 Mehtaetal. ESPStrategy Reduce M,1×1,d Split Transform d,n1×n1,d d,n2×n2,d d,n3×n3,d ··· d,nK×nK,d HFF Sum Sum Sum Merge Concat Sum (a) (b) Fig.1:(a)Thestandardconvolutionlayerisdecomposedintopoint-wiseconvolutionandspatial pyramid of dilated convolutions to build an efficient spatial pyramid (ESP) module. (b) Block diagramofESPmodule.ThelargeeffectivereceptivefieldoftheESPmoduleintroducesgridding artifacts,whichareremovedusinghierarchicalfeaturefusion(HFF).Askip-connectionbetween inputandoutputisaddedtoimprovetheinformationflow.SeeSection3formoredetails.Dilated convolutionallayersaredenotedas(#inputchannels,effectivekernelsize,#outputchannels). Theeffectivespatialdimensionsofadilatedconvolutionalkernelarenk×nk,wherenk=(n− 1)2k−1+1, k=1,···,K. Note that only n×n pixels participate in the dilated convolutional kernel.Inourexperimentsn=3andd= M. K principle(Fig.1).BasedontheseESPmodules,weintroduceanefficientnetworkstruc- ture,ESPNet,thatcanbeeasilydeployedonresource-constrainededgedevices.ESP- Netisfast,small,lowpower,andlowlatency,yetstillpreservessegmentationaccuracy. ESP is based on a convolution factorization principle that decomposes a standard convolutionintotwosteps:(1)point-wiseconvolutionsand(2)spatialpyramidofdi- lated convolutions, as shown in Fig. 1. The point-wise convolutions help in reducing thecomputation,whilethespatialpyramidofdilatedconvolutionsre-samplesthefea- turemapstolearntherepresentationsfromlargeeffectivereceptivefield.Weshowthat ourESPmoduleismoreefficientthanotherfactorizedformsofconvolutions,suchas Inception[11–13]andResNext[14].Underthesameconstraintsonmemoryandcom- putation,ESPNetoutperformsMobileNet[16]andShuffleNet[17](twootherefficient networks that are built upon the factorization principle). We note that existing spatial pyramid methods (e.g. the atrous spatial pyramid module in [3]) are computationally expensiveandcannotbeusedatdifferentspatiallevelsforlearningtherepresentations. In contrast to these methods, ESP is computationally efficient and can be used at dif- ferentspatiallevelsofaCNNnetwork.Existingmodelsbasedondilatedconvolutions [1,3,18,19]arelargeandinefficient,butourESPmodulegeneralizestheuseofdilated convolutionsinanovelandefficientway. ToanalyzetheperformanceofaCNNnetworkonedgedevices,weintroducesev- eralnewperformancemetrics,suchassensitivitytoGPUfrequencyandwarpexecution efficiency.ToshowcasethepowerofESPNet,weevaluateourmodelononeofthemost expensive tasks in AI and computer vision: semantic segmentation. ESPNet is empir- icallydemonstratedtobemoreaccurate,efficient,andfastthanENet[20],oneofthe mostpower-efficientsemanticsegmentationnetworks,whilelearningasimilarnumber ofparameters.OurresultsalsoshowthatESPNetlearnsgeneralizablerepresentations and outperforms ENet [20] and another efficient network ERFNet [21] on the unseen dataset. ESPNet can process a high resolution RGB image at a rate of 112, 21, and 9 framespersecondontheNVIDIATitanX,GTX-960M,andJetsonTX2respectively. ESPNet:EfficientSpatialPyramidofDilatedConvolutionsforSemanticSegmentation 3 2 RelatedWork Differenttechniques,suchasconvolutionfactorization,networkcompression,andlow- bit networks, have been proposed to speed up CNNs. We, first, briefly describe these approachesandthenprovideabriefoverviewofCNN-basedsemanticsegmentation. Convolutionfactorization:Convolutionalfactorizationdecomposestheconvolutional operation into multiple steps to reduce the computational complexity. This factoriza- tion has successfully shown its potential in reducing the computational complexity of deep CNN networks (e.g. Inception [11–13], factorized network [22], ResNext [14], Xception[15],andMobileNets[16]).ESPmodulesarealsobuiltonthisfactorization principle.TheESPmoduledecomposesaconvolutionallayerintoapoint-wiseconvo- lutionandspatialpyramidofdilatedconvolutions.Thisfactorizationhelpsinreducing thecomputationalcomplexity,whilesimultaneouslyallowingthenetworktolearnthe representationsfromalargeeffectivereceptivefield.NetworkCompression:Another approachforbuildingefficientnetworksiscompression.Thesemethodsusetechniques suchashashing[23],pruning[24],vectorquantization[25],andshrinking[26,27]to reduce the size of the pre-trained network. Low-bit networks: Another approach to- wardsefficientnetworksislow-bitnetworks,whichquantizetheweightstoreducethe networksizeandcomplexity(e.g.[28–31]).SparseCNN:Toremovetheredundancy inCNNs,sparseCNNmethods,suchassparsedecomposition[32],structuralsparsity learning[33],anddictionary-basedmethod[34],havebeenproposed. Wenotethatcompression-basedmethods,low-bitnetworks,andsparseCNNmeth- odsareequallyapplicabletoESPNetsandarecomplementarytoourwork. Dilatedconvolution:Dilatedconvolutions[35]areaspecialformofstandardconvo- lutions in which the effective receptive field of kernels is increased by inserting zeros (orholes)betweeneachpixelintheconvolutionalkernel.Foran×ndilatedconvolu- tionalkernelwithadilationrateofr,theeffectivesizeofthekernelis[(n−1)r+1]2. Thedilationratespecifiesthenumberofzeros(orholes)betweenpixels.However,due to dilation, only n×n pixels participate in the convolutional operation, reducing the computationalcostwhileincreasingtheeffectivekernelsize. YuandKoltun[18]stackeddilatedconvolutionlayerswithincreasingdilationrate tolearncontextualrepresentationsfromalargeeffectivereceptivefield.Asimilarstrat- egy was adopted in [19,36,37]. Chen et al. [3] introduced an atrous spatial pyramid (ASP)module.Thismodulecanbeviewedasaparallelizedversionof[3].Thesemod- ules are computationally inefficient (e.g. ASPs have high memory requirements and learnmanymoreparameters;seeSection3.2).OurESPmodulealsolearnsmulti-scale representations using dilated convolutions in parallel; however, it is computationally efficientandcanbeusedatanyspatiallevelofaCNNnetwork. CNNforsemanticsegmentation:DifferentCNN-basedsegmentationnetworkshave been proposed, such as multi-dimensional recurrent neural networks [38], encoder- decoders[20,21,39,40],hypercolumns[41],region-basedrepresentations[42,43],and cascadednetworks[44].Severalsupportingtechniquesalongwiththesenetworkshave beenusedforachievinghighaccuracy,includingensemblingfeatures[3],multi-stage training[45],additionaltrainingdatafromotherdatasets[1,3],objectproposals[46], CRF-basedpostprocessing[3],andpyramid-basedfeaturere-sampling[1–3]. 4 Mehtaetal. Encoder-decoder networks: Our work is related to this line of work. The encoder- decodernetworksfirstlearntherepresentationsbyperformingconvolutionalanddown- samplingoperations.Theserepresentationsarethendecodedbyperformingup-sampling andconvolutionaloperations.ESPNetfirstlearnstheencoderandthenattachesalight- weight decoder to produce the segmentation mask. This is in contrast to existing net- workswherethedecoderiseitheranexactreplicaoftheencoder(e.g.[39])orisrela- tivelysmall(butnotlightweight)incomparisontotheencoder(e.g.[20,21]). Featurere-samplingmethods:Thefeaturere-samplingmethodsre-sampletheconvo- lutional feature maps at the same scale using different pooling rates [1,2] and kernel sizes [3] for efficient classification. Feature re-sampling is computationally expensive andisperformedjustbeforetheclassificationlayertolearnscale-invariantrepresenta- tions.Weintroduceacomputationallyefficientconvolutionalmodulethatallowsfeature re-samplingatdifferentspatiallevelsofaCNNnetwork. 3 ESPNet WedescribeESPNetanditscoreESPmodule.WecompareESPmoduleswithsimilar CNNmodules,Inception[11–13],ResNext[14],MobileNet[16],andShuffleNet[17]. 3.1 ESPmodule ESPNetisbasedonefficientspatialpyramid(ESP)modules,afactorizedformofcon- volutions that decompose a standard convolution into a point-wise convolution and a spatial pyramid of dilated convolutions (see Fig. 1a). The point-wise convolution ap- pliesa1×1convolutiontoprojecthigh-dimensionalfeaturemapsontoalow-dimensional space.Thespatialpyramidofdilatedconvolutionsthenre-samplestheselow-dimensional featuremapsusingK,n×ndilatedconvolutionalkernelssimultaneously,eachwitha dilationrateof2k−1,k={1,···,K}.Thisfactorizationdrasticallyreducesthenumber of parameters and the memory required by the ESP module, while preserving a large effectivereceptivefield(cid:2)(n−1)2K−1+1(cid:3)2.Thispyramidalconvolutionaloperationis called a spatial pyramid of dilated convolutions, because each dilated convolutional kernellearnsweightswithdifferentreceptivefieldsandsoresemblesaspatialpyramid. AstandardconvolutionallayertakesaninputfeaturemapF ∈RW×H×Mandapplies i NkernelsK∈Rm×n×M toproduceanoutputfeaturemapF ∈RW×H×N,whereW and o H representthewidthandheightofthefeaturemap,mandnrepresentthewidthand height of the kernel, and M and N represent the number of input and output feature channels. For simplicity, we will assume that m=n. A standard convolutional kernel thuslearnsn2MN parameters.Theseparametersaremultiplicativelydependentonthe spatialdimensionsofthen×nkernelandthenumberofinputMandoutputNchannels. Width divider K: To reduce the computational cost, we introduce a simple hyper- parameterK.TheroleofKistoshrinkthedimensionalityofthefeaturemapsuniformly acrosseachESPmoduleinthenetwork.Reduce:ForagivenK,theESPmodulefirst reduces the feature maps from M-dimensional space to N-dimensional space using a K point-wiseconvolution(Step1inFig.1a).Split:Thelow-dimensionalfeaturemapsare split across K parallel branches. Transform: Each branch then processes these feature ESPNet:EfficientSpatialPyramidofDilatedConvolutionsforSemanticSegmentation 5 RGB withoutHFF withHFF (a) (b) Fig.2:(a)Anexampleillustratingagriddingartifactwithasingleactivepixel(red)convolved witha3×3dilatedconvolutionalkernelwithdilationrater=2.(b)Visualizationoffeaturemaps ofESPmoduleswithandwithouthierarchicalfeaturefusion(HFF).HFFinESPeliminatesthe griddingartifact.Bestviewedincolor. maps simultaneously using n×n dilated convolutional kernels with different dilation ratesgivenby2k−1, k={1,···,K−1}(Step2inFig.1a).Merge:Theoutputsofthe K paralleldilatedconvolutionalkernelsareconcatenatedtoproduceanN-dimensional outputfeaturemapFig.1bvisualizesthereduce-split-transform-mergestrategy. TheESPmodulehas(NM+(Nn)2)/K parametersanditseffectivereceptivefield is((n−1)2K−1+1)2.Comparedtothen2NM parametersofthestandardconvolution, factorizingitreducesthenumberofparametersbyafactorof n2MK ,whileincreasing M+n2N theeffectivereceptivefieldby∼(2K−1)2.Forexample,theESPmodulelearns∼3.6× fewerparameterswithaneffectivereceptivefieldof17×17thanastandardconvolu- tionalkernelwithaneffectivereceptivefieldof3×3forn=3,N=M=128,andK=4. Hierarchicalfeaturefusion(HFF)forde-gridding:Whileconcatenatingtheoutputs of dilated convolutions give the ESP module a large effective receptive field, it intro- ducesunwantedcheckerboardorgriddingartifacts,asshowninFig.2.Toaddressthe gridding artifact in ESP, the feature maps obtained using kernels of different dilation ratesarehierarchicallyaddedbeforeconcatenatingthem(HFFinFig.1b).Thissimple, effective solution does not increase the complexity of the ESP module, in contrast to existing methods that remove the gridding artifact by learning more parameters using dilatedconvolutionalkernels[19,37].Toimprovegradientflowinsidethenetwork,the inputandoutputfeaturemapsarecombinedusinganelement-wisesum[47]. 3.2 RelationshipwithotherCNNmodules TheESPmodulesharessimilaritieswiththefollowingCNNmodules. MobileNetmodule:TheMobileNetmodule[16],showninFig.3a,usesadepth-wise separableconvolution[15]thatfactorizesastandardconvolutionsintodepth-wisecon- volutions(transform)andpoint-wiseconvolutions(expand).Itlearnslessparameters, has high memory requirement, and low receptive field than the ESP module. An ex- treme version of the ESP module (with K =N) is almost identical to the MobileNet 6 Mehtaetal. ConvolutionType Depth-wise Grouped M,1×1,d M,1×1,d M,1×1,d ··· M,1×1,d Standard M,1×1,d M,1×1,d ··· M,1×1,d d,3×3,d d,n×n,d d,n×n,d ··· d,n×n,d M,3×3,M d,1×1,N d,n×n,d d,n×n,d ··· d,n×n,d d,1×1,N d,1×1,N ··· d,1×1,N M,1×1,N Sum Sum Concat Sum MobileNet (a)MobileNet (b)ShuffleNet (c)Inception (d)ResNext Module #Parameters Memory(inMB) EffectiveReceptiveField MobileNet M(n2+N)=11,009 (M+N)WH=2.39 [n]2=3×3 ShuffleNet dg(M+N)+n2d=2,180WH(2∗d+N)=1.67 [n]2=3×3 Inception K(Md+n2d2)=28,000 2KWHd=2.39 [n]2=3×3 ResNext K(Md+d2n2+dN)=38,000KWH(2d+N)=8.37 [n]2=3×3 M2,0n×n,N M2,1n×n,N ··· M,n×n2,KN−1 ASP KMNn2=450,000 KWHN=5.98(cid:2)(n−1)2K−1+1(cid:3)2=33×33 ESP(Fig.1b) Md+Kn2d2=20,000 WHd(K+1)=1.43(cid:2)(n−1)2K−1+1(cid:3)2=33×33 Sum Here,M=N=100,n=3,K=5,d=NK=20,g=2,andW=H=56. (e)ASP (f)Comparisonbetweendifferentmodules Fig.3:Differenttypesofconvolutionalmodulesforcomparison.Wedenotethelayeras(#input channels,kernelsize,#outputchannels).Dilationratein(e)isindicatedontopofeachlayer. Here,grepresentsthenumberofconvolutionalgroupsingroupedconvolution[48].Forsimplic- ity,weonlyreportthememoryofconvolutionallayersin(d).Forconvertingtherequiredmemory tobytes,wemultiplyitby4(1floatrequires4bytesforstorage). module,differingonlyintheorderofconvolutionaloperations.IntheMobileNetmod- ule, the spatial convolutions are followed by point-wise convolutions; however, in the ESPmodule,point-wiseconvolutionsarefollowedbyspatialconvolutions. ShuffleNet module: The ShuffleNet module [17], shown in Fig. 3b, is based on the principleofreduce-transform-expand.Itisanoptimizedversionofthebottleneckblock inResNet[47].Toreducecomputation,Shufflenetmakesuseofgroupedconvolutions [48]anddepth-wiseconvolutions[15].Itreplaces1×1and3×3convolutionsinthe bottleneckblockinResNetwith1×1groupedconvolutionsand3×3depth-wisesep- arable convolutions, respectively. The Shufflenet module learns many less parameters thantheESPmodule,buthashighermemoryrequirementsandasmallerreceptivefield. Inceptionmodule:Inceptionmodules[11–13]arebuiltontheprincipleofsplit-reduce- transform-mergeandareusuallyheterogeneousinnumberofchannelsandkernelsize (e.g. some of the modules are composed of standard and factored convolutions). In contrast,ESPmodulesarestraightforwardandsimpletodesign.Forthesakeofcom- parison,thehomogeneousversionofanInceptionmoduleisshowninFig.3c.Fig.3f comparestheInceptionmodulewiththeESPmodule.ESP(1)learnsfewerparameters, (2)hasalowmemoryrequirement,and(3)hasalargereffectivereceptivefield. ResNext module: A ResNext module [14], shown in Fig. 3d, is a parallel version of thebottleneckmoduleinResNet[47],basedontheprincipleofsplit-reduce-transform- expand-merge. The ESP module is similar in branching and residual summation, but moreefficientinmemoryandparameterswithalargereffectivereceptivefield. Atrousspatialpyramid(ASP)module:AnASPmodule[3],showninFig.3e,isbuilt on the principle of split-transform-merge. The ASP module involves branching with each branch learning kernel at a different receptive field (using dilated convolutions). Though ASP modules tend to perform well in segmentation tasks due to their high effectivereceptivefields,ASPmoduleshavehighmemoryrequirementsandlearnmany moreparameters.UnliketheASPmodule,theESPmoduleiscomputationallyefficient. ESPNet:EfficientSpatialPyramidofDilatedConvolutionsforSemanticSegmentation 7 4 Experiments To showcase the power of ESPNet, we evaluate ESPNet’s performance on several se- manticsegmentationdatasetsandcomparetothestate-of-the-artnetworks. 4.1 Experimentalset-up Network structure: ESPNet uses ESP modules for learning convolutional kernels as wellasdown-samplingoperations,exceptforthefirstlayer:astandardstridedconvo- lution. All layers are followed by a batch normalization [49] and a PReLU [50] non- linearityexceptthelastpoint-wiseconvolution,whichhasneitherbatchnormalization nornon-linearity.Thelastlayerfeedsintoasoftmaxforpixel-wiseclassification. DifferentvariantsofESPNetareshowninFig.4.Thefirstvariant,ESPNet-A(Fig. 4a), is a standard network that takes an RGB image as an input and learns represen- tations at different spatial levels3 using the ESP module to produce a segmentation mask.Thesecondvariant,ESPNet-B(Fig.4b),improvestheflowofinformationinside ESPNet-AbysharingthefeaturemapsbetweenthepreviousstridedESPmoduleand the previous ESP module. The third variant, ESPNet-C (Fig. 4c), reinforces the input image inside ESPNet-B to further improve the flow of information. These three vari- antsproduceoutputswhosespatialdimensionsare 1thoftheinputimage.Thefourth 8 variant,ESPNet(Fig.4d),addsalightweightdecoder(builtusingaprincipleofreduce- upsample-merge)toESPNet-Cthatoutputsthesegmentationmaskofthesamespatial resolutionastheinputimage. Tobuilddeepercomputationallyefficientnetworksforedgedeviceswithoutchang- ing the network topology, a hyper-parameter α controls the depth of the network; the ESPmoduleisrepeatedα timesatspatiallevell.CNNsrequiremorememoryathigher l spatiallevels(atl=0andl=1)becauseofthehighspatialdimensionsoffeaturemaps attheselevels.Tobememoryefficient,neithertheESPnortheconvolutionalmodules arerepeatedatthesespatiallevels. Dataset:WeevaluatedtheESPNetontheCityscapesdataset[6],anurbanvisualscene- understanding dataset that consists of 2,975 training, 500 validation, and 1,525 test high-resolution images. The task is to segment an image into 19 classes belonging to 7 categories (e.g. person and rider classes belong to the same category human). We evaluatedournetworksonthetestsetusingtheCityscapesonlineserver. Tostudythegeneralizability,wetestedtheESPNetonanunseendataset.Weused the Mapillary dataset [51] for this task because of its diversity. We mapped the anno- tations (65 classes) in the validation set (# 2,000 images) to seven categories in the Cityscape dataset. To further study the segmentation power of our model, we trained and tested the ESPNet on two other popular datasets from different domains. First, weusedthewidelyknownPASCALVOCdataset[52]thathas1,464trainingimages, 1,448 validation images, and 1,456 test images. The task is to segment an image into 20foregroundclasses.Weevaluateournetworksonthetestset(comp6category)using thePASCALVOConlineserver.Followingtheconvention,weusedadditionalimages 3Ateachspatiallevell,thespatialdimensionsofthefeaturemapsarethesame.Tolearnrepre- sentationsatdifferentspatiallevels,adown-samplingoperationisperformed(seeFig.4a). 8 Mehtaetal. RGBImage RGBImage RGBImage RGBImage SegmentationMask l=0 (3,16) (3,16) (C,C) (3,16) Conv-3 Conv-3 DeConv l=1 Conv-3 (16,64) (3,16) (2C,C) Concat (16,64) ESP (19,64) Conv-3 Conv-1 l=2 ESP (64,64) ESP (19,C) ESP l=2 ×ESαP2 (64,64) C×onαc2at ×ESαP2(64,64) CEoSncPa(t19,64C)onv-1 DCeoCnocna(tvC,C) (128,128) Concat (64,64) l=3 ESP(64,128) ESP(128,128) ESP(131,128) ×ESαP2 ESP(2C,C) (131,C) (128,128) ESP (128,128) l=3 ×ESαP ×α3 ×ESαP3 Conca(t131,1C2o8n)v-1 Concat 3 l=3 Conv-(1128,C) CCoonnvc-a1(t256,C) Conca(t256,C) EESSPP(128,128) DeCon(vC,C) Conv-1 ×α3 (256,C) SegmentationMask SegmentationMask SegmentationMask Concat Conv-1 (a)ESPNet-A (b)ESPNet-B (c)ESPNet-C (d)ESPNet Fig.4:ThepathfromESPNet-AtoESPNet.Redandgreencolorboxesrepresentthemodules responsiblefordown-samplingandup-samplingoperations,respectively.Spatial-levell isindi- catedontheleftofeverymodulein(a).Wedenoteeachmoduleas(#inputchannels,#output channels).Here,Conv-nrepresentsn×nconvolution. from[53,54].Secondly,weusedabreastbiopsywholeslideimagedataset[36],cho- senbecausetissuestructuresinbiomedicalimagesvaryinsizeandshapeandbecause this dataset allowed us to check the potential of learning representations from a large receptive field. The dataset consists of 30 training images and 28 validation images, whoseaveragesizeis10,000×12,000pixels,muchlargerthannaturalsceneimages. Thetaskistosegmenttheimagesinto8biologicaltissuelabels;detailsarein[36]. Performance evaluation metrics: Most traditional CNNs measure network perfor- manceintermsofaccuracy,latency,networkparameters,andnetworksize[16,17,20, 21,55].Thesemetricsprovidehigh-levelinsightaboutthenetwork,butfailtodemon- stratetheefficientusageofhardwareresourceswithlimitedavailability.Inadditionto thesemetrics,weintroduceseveralsystem-levelmetricstocharacterizetheperformance ofaCNNonresource-constraineddevices[56,57]. SegmentationaccuracyismeasuredasameanIntersectionoverUnion(mIOU)score betweenthegroundtruthandthepredictedsegmentationmask. LatencyrepresentstheamountoftimeaCNNnetworktakestoprocessanimage.This isusuallymeasuredintermsofframespersecond(FPS). Networkparametersrepresentsthenumberofparameterslearnedbythenetwork. Networksizerepresentstheamountofstoragespacerequiredtostorethenetworkpa- rameters.Anefficientnetworkshouldhaveasmallernetworksize. Powerconsumptionistheaveragepowerconsumedbythenetworkduringinference. SensitivitytoGPUfrequencymeasuresthecomputationalcapabilityofanapplication andisdefinedasaratioofpercentagechangeinexecutiontimetothepercentagechange inGPUfrequency.Highervaluesindicatehigherefficiency. ESPNet:EfficientSpatialPyramidofDilatedConvolutionsforSemanticSegmentation 9 Utilizationratesmeasuretheutilizationofcomputeresources(CPU,GPU,andmem- ory) while running on an edge device. In particular, computing units in edge devices (e.g.JetsonTX2)sharememorybetweenCPUandGPU. Warp execution efficiency is defined as the average percentage of active threads in eachexecutedwarp.GPUsschedulethreadsaswarps;eachthreadisexecutedinsingle instructionmultipledatafashion.HighervaluesrepresentefficientusageofGPU. Memory efficiency is the ratio of number of bytes requested/stored to the number of bytestransferedfrom/todevice(orshared)memorytosatisfyload/storerequests.Since memorytransactionsareinblocks,thismetricmeasuresmemorybandwidthefficiency. Training details: ESPNet networks were trained using PyTorch [58] with CUDA 9.0 and cuDNN back-ends. ADAM [59] was used with an initial learning rate of 0.0005, and decayed by two after every 100 epochs and with a weight decay of 0.0005. An inverseclassprobabilityweightingschemewasusedinthecross-entropylossfunction to address the class imbalance [20,21]. Following [20,21], the weights were initial- ized randomly. Standard strategies, such as scaling, cropping and flipping, were used toaugmentthedata.TheimageresolutionintheCityscapedatasetis2048×1024,and alltheaccuracyresultswerereportedatthisresolution.Fortrainingthenetworks,we sub-sampled the RGB images by two. When the output resolution was smaller than 2048×1024,theoutputwasup-sampledusingbi-linearinterpolation.Fortrainingon the PASCAL dataset, we used a fixed image size of 512×512. For the WSI dataset, thepatch-wisetrainingapproachwasfollowed[36].ESPNetwastrainedintwostages. First, ESPNet-C was trained with down-sampled annotations. Second, a light-weight decoderwasattachedtoESPNet-Candthen,theentireESPNetnetworkwastrained. Three different GPU devices were used for our experiments: (1) a desktop with a NVIDIA TitanX GPU (3,584 CUDA cores), (2) a laptop with a NVIDIA GTX-960M GPU (640 CUDA cores), and (3) an edge device with a NVIDIA Jetson TX2 (256 CUDAcores).Unlessandotherwisestatedexplicitly,statisticsarereportedforanRGB image of size 1024×512 averaged over 200 trials. For collecting the hardware-level statistics,NVIDIA’sandIntel’shardwareprofilingandtracingtools,suchasNVPROF [60],Tegrastats[61],andPowerTop[62],wereused.Inourexperiments,wewillrefer toESPNetwithα =2andα =8asESPNetuntilandotherwisestatedexplicitly. 2 3 4.2 SegmentationresultsontheCityscapedataset Comparison with efficient convolutional modules: In order to understand the ESP module,wereplacedtheESPmodulesinESPNet-Cwithstate-of-the-artefficientcon- volutionalmodules,sketchedinFig.3(MobileNet[16],ShuffleNet[17],Inception[11– 13],ResNext[14],andResNet[47])andevaluatedtheirperformanceontheCityscape validationdataset.WedidnotcomparewithASP[3],becauseitiscomputationallyex- pensiveandnotsuitableforedgedevices.Fig.5comparestheperformanceofESPNet- Cwithdifferentconvolutionalmodules.OurESPmoduleoutperformedMobileNetand ShuffleNetmodulesby7%and12%,respectively,whilelearningasimilarnumberof parametersandhavingcomparablenetworksizeandinferencespeed.Furthermore,the ESPmoduledeliveredcomparableaccuracytoResNextandInceptionmoreefficiently. AbasicResNetmodule(stackoftwo3×3convolutionswithaskip-connection)deliv- eredthebestperformance,buthadtolearn6.5×moreparameters. 10 Mehtaetal. (a)Accuracyvs.networksize (b)Accuracyvs.speed(laptop) Fig.5:Comparisonbetweenstate-of-the-artefficientconvolutionalmodules.Forafaircompari- sonbetweendifferentmodules,weusedK=5,d= NK,α2=2,andα3=3.Weusedstandard strided convolution for down-sampling. For ShuffleNet, we used g=4 and K =4 so that the resultantESPNet-CnetworkhasthesamecomplexityaswiththeESPblock. Comparisonwithsegmentationmethods:WecomparedtheperformanceofESPNet withstate-of-the-artsemanticsegmentationnetworks.Thesenetworkseitheruseapre- trained network (VGG [63]: FCN-8s [45] and SegNet [39], ResNet [47]: DeepLab-v2 [3] and PSPNet [1], and SqueezeNet [55]: SQNet [64]) or were trained from scratch (ENet[20]andERFNet[21]).ESPNetis2%moreaccuratethanENet[20],whilerun- ning1.27×and1.16×fasteronadesktopandalaptop,respectively(Fig.6).ESPNet makessomemistakesbetweenclassesthatbelongtothesamecategory,andhencehas alowerclass-wiseaccuracy.Forexample,aridercanbeconfusedwithaperson.How- ever,ESPNetdeliversagoodcategory-wiseaccuracy.ESPNethad8%lowercategory- wisemIOUthanPSPNet[1],whilelearning180×fewerparameters.ESPNethadlower powerconsumption,hadlowerbatterydischargerate,andwassignificantlyfasterthan state-of-the-art methods, while still achieving a competitive category-wise accuracy; thismakesESPNetsuitableforsegmentationonedgedevices.ERFNet,ananothereffi- cientsegmentationnetwork,deliveredgoodsegmentationaccuracy,buthas5.5×more parameters,is5.44×larger,consumesmorepower,andhasahigherbatterydischarge ratethanESPNet.Also,ERFNetdoesnotutilizelimitedavailablehardwareresources efficientlyonedgedevices(Section4.4). 4.3 Segmentationresultsonotherdatasets Unseen dataset: Table 1a compares the performance of ESPNet with ENet [20] and ERFNet [21] on an unseen dataset. These networks were trained on the Cityscapes dataset [6] and tested on the Mapillary (unseen) dataset [51]. ENet and ERFNet were chosen, due to the efficiency and power of ENet and high accuracy of ERFNet. Our experimentsshowthatESPNetlearnsgoodgeneralizablerepresentationsofobjectsand outperformsENetandERFNetontheunseendataset. PASCALVOC2012dataset:(Table1c)OnthePASCALdataset,ESPNetis4%more accuratethanSegNet,oneofthesmallestnetworkonthePASCALVOC,whilelearning 81× fewer parameters. ESPNet is 22% less accurate than PSPNet (one of the most accuratenetworkonthePASCALVOC)whilelearning180×fewerparameters. Breastbiopsydataset:(Table1d)Onthebreastbiopsydataset,ESPNetachievedthe sameaccuracyas[36]whilelearning9.5×lessparameters.

Description:
ESP is based on a convolution factorization principle that decomposes a standard convolution into two .. from [53, 54]. Secondly, we used a . 4.4 Performance analysis on the NVIDIA Jetson TX2 (edge device). Network size: Fig.
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.