Learning Multi-level Region Consistency with Dense Multi-label Networks for Semantic Segmentation TongShen∗, GuoshengLin∗, ChunhuaShen, IanReid SchoolofComputerScience,TheUniversityofAdelaide,Adelaide,SA5005,Australia e-mail: [email protected] 7 1 0 Abstract 5.Conclusion 8 2 n Semantic image segmentation is a fundamental task in a image understanding. Per-pixel semantic labelling of an J image benefits greatly from the ability to consider region 5 consistencybothlocallyandglobally.However,manyFully 2 ConvolutionalNetworkbasedmethodsdonotimposesuch consistency, which may give rise to noisy and implausible ] V predictions. We address this issue by proposing a dense C multi-labelnetworkmodulethatisabletoencouragethere- . gionconsistencyatdifferentlevels.Thissimplebuteffective s c modulecanbeeasilyintegratedintoanysemanticsegmen- [ tation systems. With comprehensive experiments, we show that the dense multi-label can successfully remove the im- 1 v plausible labels and clear the confusion so as to boost the 2 performanceofsemanticsegmentationsystems. 2 1 7 Contents 0 . 1 1.Introduction 2 0 7 2.RelatedWork 3 1 : v 3.Methods 3 i X 3.1.DenseMulti-label . . . . . . . . . . . . . . 3 r 3.2.OverviewofFramework . . . . . . . . . . . 3 a 3.3.DenseMulti-labelBlock . . . . . . . . . . . 4 3.4.GroundTruthGeneration. . . . . . . . . . . 4 3.5.NetworkConfiguration . . . . . . . . . . . . 5 4.ExperimentsandAnalysis 5 4.1.ResultsonADE20kdataset . . . . . . . . . 5 4.2.ResultsonPASCAL-Context. . . . . . . . . 5 4.3.ResultsonNYUDv2 . . . . . . . . . . . . . 6 4.4.ResultsonSUN-RGBD . . . . . . . . . . . 7 4.5.AblationStudyonPASCAL-Context . . . . 7 4.6.FailureAnalysis . . . . . . . . . . . . . . . 8 ∗Thefirsttwoauthorscontributedequally. Correspondenceshouldbe addressedtoC.Shen. 1 Dense multi-label Appeared classes in prediction Network with BG cat person window1 BG cat person region consistency window2 BG cat person Prediction Consistent Appeared classes window3 BG person in ground truth BG cat person Input Ground truth Inconsistent Figure2.Illustrationofdensemulti-labelwithmulti-level. There Appeared classes arethreewindowswithdifferentsizes.Theredwindow,thesmall- in prediction Network without BG cat person est,focusesmoreonthelocalregionconsistency,whilethegreen region consistency dog plant window is responsible for global region consistency. The other one,inblue,isformid-levelconsistency.Byslidingthewindows, Prediction wecanperformmulti-labeldenselyforeachspatialpoint. Figure1.Illustrationofregionconsistency. Foraregioninthein- putimage,whichiscolouredinred,thecorrespondingpartinthe troducingdensemulti-labelpredictionforimageregionsof groundtruthcontainsonlythreeclasses. Inthenetworkwithout varioussizes. region consistency, there are five classes that appear. If we ex- Dense multi-label prediction is performed in a sliding plicitlyencouragetheconsistency, thoseunlikelyclasseswillbe window fashion: the classification for each spatial point is eliminatedandthepredictionwillbebetterasshownontop. influencedbythenetworkpredictionandbythemulti-label result for the surrounding window. By employing differ- 1.Introduction ent window sizes, we are able to construct a multi-level structurefordensemulti-labelandenforcetheregioncon- Semantic segmentation is one of the fundamental prob- sistencyatdifferentlevelsbothlocallyandglobally. Figure lems in computer vision, whose task is to assign a seman- 2isanillustrationofdensemulti-labelatmultiplewindows tic label to each pixel of an image so that different classes sizes.Hereweusethreewindowsofdifferentsizes. Thered can be distinguished. This topic has been widely studied window,thesmallest,focusesmoreonthelocalregioncon- [1, 2, 3, 4, 5, 6]. Among these models, Fully Convolu- sistency,whilethegreenwindow,thelargest,isresponsible tional Network (FCN) based models have become domi- for global region consistency. The other one, in blue, is nant [7, 8, 9, 10, 11, 12]. These models are simple and for mid-level consistency. By sliding the windows to con- effectivebecauseofthepowerfulcapacityofConvolutional sider each spatial point, we perform multi-label densely at NeuralNetworks(CNNs)andbeingabletobetrainedend- different level, encouraging the segmentation predictor to to-end. However, most existing methods do not have the give predictions that are consistent with the dense multi- mechanism to enforce the region consistency, which plays labelprediction. an important role in semantic segmentation. Consider, for Ourcontributionsareasfollows: example,Figure1,inwhichthelowerleftimageistheout- putofavanillaFCN,whosepredictioncontainssomenoisy • We address the problem of region consistency in se- labelsthatdonotappearinthegroundtruth. Withenforced manticsegmentationbyproposingadensemulti-label regionconsistency, wecansimplyeliminatethoseimplau- moduletoachievethegoalofretainingregionconsis- siblelabelsandcleartheconfusion. Ouraiminthisworkis tency,whichissimpleandeffective.Wealsointroduce tointroduceconstraintstoencouragethisconsistency. amulti-levelstructurefordensemulti-labeltopreserve Ourproposalisbothsimpleandeffective: wearguethat regionconsistencybothlocallyandglobally. theregionconsistencyinacertainregioncanbeformulated as a multi-label classification problem. Multi-label classi- • Weevaluateourmethodonfourpopularsemanticseg- fication has also been widely studied [13, 14, 15, 16, 17], mentation datasets including NYUDv2, SUN-RGBD, whosetaskistoassignoneormorelabelstotheimage. By PASCAL-ContextandADE20k,andachievepromis- performingmulti-labelclassificationinaregion,wecanal- ingresults. Wealsogiveanalysisonhowdensemulti- low the data to suggest which labels are likely within the label can remove the implausible labels, clear confu- broad context of the region, and use this information to sionandeffectivelyboostthesegmentationsystems. suppress implausable classes predicted without reference to the broader context, thereby improving scene consis- This paper is organized as follows. Firstly we review tency. While typical multi-label problems are formulated related work in Section 2. We then explain dense multi- aswhole-imageinference,weadaptthisapproachtodense labelanddescribetheoverviewofourstructureinSection predictionproblemssuchassemanticsegmentation,byin- 3. In Section 4, we show comprehensive experiments and analyzetheresults. Intheend,wedrawconlusionsandtalk 3.Methods aboutfutureworkinSection5. 3.1.DenseMulti-label Multi-labelclassificationisataskwhereeachimagecan 2.RelatedWork havemorethanonelabel,unlikeamulti-classclassification problem [21, 22, 23, 24] whose goal is to assign only one Semanticsegmentationhasbeenwidelystudied[1,2,3, label to the image. This is more natural in reality because 4,5,6]. EarlyCNNbasedmethodsrelyonregionpropos- formajorityofimages,objectsarenotisolated,insteadthey als or superpixels. They make segmentation prediction by are in context with other objects or the scene. Multi-label classifyingtheselocalfeatures. classificationgivesusmoreinformationoftheimage. Foradensepredictiontasksuchassegmentation,ittreats More recently, with Long et al. [18] introducing apply- every spatial point as a multi-class classification problem, ingFullyConvolutionalNetworks(FCNs)tosemanticseg- where the point is assigned with one of the categories. As mentation,theFCNbasedsegmentationmodels[7,8,9,10, shown in the upper part of Figure 3, the model predicts 11, 12] have become popular. In [18], Long et al. convert scoresforeachclassandpicksthehighestone. Theground thelastfullyconnectedlayersintoconvolutionallayersthus truth is an one-hot vector correspondingly. For a dense maketheCNNacceptabitraryinputsize. Sincetheoutput multi-label problem, each spatial point will be assigned retainsthespatialinformation,itisstraightforwardtotrain withseverallabelstoshowwhatlabelsappearintheacer- thenetworkjointlyinanend-to-endfashion. Theyalsoin- tainwindowcenteredatthispoint. Asshowninlowerpart troduceskiparchitecturetocombinefeaturesfromdifferent ofFigure3,therearetwoclassesbeingpredictedwithhigh levels. Chen et al. [10] modify the original FCN by intro- confidenceandthegroundtruthisgivenbya“multiplehot” ducing dilated kernels, in which kernels are inserted with vector. zeros, to enable large field of view and Fully Connected Here we propose a method to learn a dense multi-label CRF to refine outputs. Lin et al. [11] introduces a joint system and a segmentation system at the same time. We trainingmodelwithCRFs. Inthiswork,CRFsarenotsim- aim at using dense multi-label to suppress the implausible plyusedforsmoothnessasin[10],butamoregeneralterm classes and encourage appropriate classes so as to retain tolearncontextinformationtohelpboosttheunaryperfor- theregionconsistencyforthesegmentationpredictionboth mance. Liu et al. [19] utilise global features to improve globallyandlocally. Inthenextsection,moredetailsofthe semantic segmentation. They extract global features from wholeframeworkwillbeprovided. different levels and fuse them by using L2 normalization layer. Our method is different from those. We attempt to 3.2.OverviewofFramework improvetheperformanceofsegmentationbyenforcingre- AnoverviewofthestructureisshowninFigure4,with gionconsistencyusingdensemulti-label. thepartinthedashed-linerectanglebeingthedensemulti- Multi-label classification has also been widely stud- label module. Without it, the network simply becomes a ied. Traditional methods are based on graphical models FCN.Theinputimageisfirstfedintoseverallowlevelfea- [20,16], whiletherecentstudiesbenefitmorefromCNNs turelayerswhicharesharedbythefollowingblocks. Then [13,14,17].Gongetal.[17]transformasingle-labelclassi- apart from going into the segmentation block, the features ficationmodelintomulti-labelclassificationmodelanduse alsoenterthreeblocksfordensemulti-labelprediction.The ranking loss to train the model. Wei et al. [13] also use outputs oftheses blocksare merged element-wiseto make the transfer learning from single-label classification mod- thefinalprediction. els. Theyperformthemulti-labelclassificationbyfirstgen- Inthetrainingphase,thenetworkisguidedbyfourloss eratingtheobjecthypothesesandthefusingpredictionsas functions:thesegmentationlossandthreedensemulti-label the final prediction for the whole image. Jiang et al. [14] losses. Weusesoftmaxlossforthesegmentationpath,and propose a unified framework for multi-label classification uselogisticlossforallthedensemulti-labelblocks. byusingCNNandRecurrentNeuralNetwork(RNN). The dense multi-label blocks have different window sizesforperformingdensemulti-labelpredictionwithindif- Hereweproposeadensemulti-labelmoduletotakead- ferentcontexts. Withthismulti-levelstructure,weareable vantage of multi-label classification and integrate it into toretainregionconsistencybothlocallyandglobally. semantic segmentation systems. Dense multi-label is per- Let x denote the image. The process of the low level formed in a sliding window fashion and treats all area in featureblockcanbedescribedas: a window as multi-label classification. Experiments show that dense multi-label can help to keep the scene consis- o=f (x;θ ), (1) low low tency,clearconfusionandboosttheperformanceofseman- ticsegmentation. whereoistheoutputandθ thelayerparameters. low Thedensemulti-labelblocksandthesegmentationblock the dense multi-label is performed at 1/32 resolution with aredefinedas: theslidingwindowandfollowingadaptivelayers. Therea- son for this setting is because dense multi-label requires a m(j) =f(j) (o;θ(j) ),j ∈{1,2,3} (2) mul mul large sliding window, which will become a computational burdenifweworkatahighresolution. Downsamplingcan s=f (o;θ ), (3) seg seg greatly reduce the size of feature maps and more impor- wherem(j)andsdenotetheoutputofjthmulti-labelblock tantly, the size of sliding window will shrink accordingly, andtheoutputofsegmentationrespectively. θ(j) andθ thus making the computation more efficient. On the other mul seg arelayerparameters. hand, dense multi-label requires more high level informa- Thefinalpredictionis: tion. Therefore, working at a coarse level can capture the high level features better. The output of the dense multi- p=s+m(1)+m(2)+m(3), (4) labelisupsampledtobecompatiblewiththesegmentation block’soutput. wherepisthefusedscoreforsegmentation. For the loss functions, we use logistic loss for the pre- 1/8 features Dense Multi-label block diction of dense multi label blocks, m(1),m(2) and m(3); 1/32 score softmaxlossisusedforfinalpredictionp. Letm bethe ik Conv layers Sliding window Adaptive outofadensemulti-labelblockatithpositionforkthclass, with downsampling max pooling layers andymulbethegroundtruthforthecorrespondingposition ik andclass. Thelossfunctionfordensemulti-labelisdefined Figure 5. Details of a single dense multi-label block. The in- as: put features are fed into several convolutional layers and further downsampled.Thenweperformslidingwindowwithmaxpooling I K operation. Aftersomeadaptivelayers, wehavescoresfordense 1 (cid:88)(cid:88) 1 l (ymul,m)= ymullog( ) multi-labelat1/32resolution. mul IK ik 1+e−mik i k e−mik 3.4.GroundTruthGeneration +(1−ymul)log( ), (5) ik 1+e−mik where yimkul ∈ {0,1}; I and K represent the number of Sgergomunedn ttarutitohn Cghroaunnnde l-trwuitshe Degnrsoeu mndu lttriu-ltahbel spatialpointsandclasses,respectively. Similarly, let p be the fused output at ith position for ik kthclass,andysegbethegroundtruthforsegmentationpre- i diction at ith position. The loss function for segmentation isdefinedas: Figure 6. The segmentation ground truth is firstly converted to channel-wiselabels,with0or1ineachchannel.Thegroundtruth l (yseg,p)= 1(cid:88)I (cid:88)K 1(yseg =k)log( epik ), fordensemulti-labelcanbeobtainedbyperformingmaxpooling seg I i (cid:80) epij onthechannel-wiselabels. i k j (6) whereyseg ∈{1...K}. Thegroundtruthfordensemulti-labelcanbegenerated i Ourgoalistominimizetheobjectivefunction: from the segmentation ground truth. The process is de- scribedinFigure6.Firstly,thesegmentationgroundtruthis minl +λ(l(1) +l(2) +l(3) ), (7) converted to channel-wise labels, which means each chan- seg mul mul mul nelonlycontains1or0toindicatewhetherthecorrespond- where λ controls the balance between the segmentation ing class appears or not. To generate a ground-truth mask block and the dense multi-label blocks. I observe this pa- foreachclass,foragivenwindowsize,weslidethewindow rameter is not very sensitive. We set λ = 1 to treat each acrosseachbinarychannelandperformamax-poolopera- partequally. tion(thisisequivalenttoabinarydilationusingastructur- ingelementofthesamesizeandshapeasthewindow). We 3.3.DenseMulti-labelBlock repeatthisprocessforeachwindowsize. Asnotedinsec- The details of the dense multi-label block are shown in tion3.3,thedensemulti-labelclassificationisperformedat Figure5,wheretheinputisfeaturemapsat1/8resolution, 1/32resolutionwhilethesegmentationisat1/8. Therefore, duetothedownsamplinginthelowlevelfeaturelayers.Af- wegeneratemulti-labelground-truthdataat1/8resolution ter some convolutional layers with further downsampling, withstride4. Blockname Initiallayers Stride WeonlyuseRes50asbasenetworktocompareandanal- Lowlevelfeatureblock conv1tores3d 8 ysetheperformance. Foralltheexperiments,weusebatch Segmentationblock res4atores5c 1 sizeof8,momentumof0.9andweightdecayof0.0005. Densemulti-labelblock res4atores5c 4 4.1.ResultsonADE20kdataset Table1.ConfigurationforRes50network. Thelowlevelfeature blockisinitializedbylayers“conv1”to“res3d”andhas8stride. We first evaluate our result on ADE 20k dataset[26], Thesegmentationblockanddensemulti-labelblocksareinitial- which contains 150 semantic categories including objects izedbylayers“res4a”to“res5c”butdonotsharetheweightswith such as person, car etc., and “stuff” such as sky, road etc. eachother. Thesegmentationblockdoesnothaveanydownsam- pling,butthedensemulti-labelblockshavefurther4stridedown- Thereare20210imagesinthetrainingsetand2000images sampling. inthevalidationset. wall sky building 3.5.NetworkConfiguration floor ship stairway sky building grandstand sky building Thedensemulti-labelmoduleissuitableforanysegmen- tationsystemanditcanbeeasilyintegrated. Inthisstudy, weuseResidual50-layernetwork[23]withdilatedkernels [10]. In order to work at a relatively high resolution while wall person fence floor railing column keepingtheefficiency,weuse8-stridesetting,whichmeans wall ceiling person light ceiling building wall ceiling person floor signboard windowpane floor signboard that the final output is at 1/8 resolution. As we mentioned inthelastsection,weperformdensemulti-labelat1/32res- olutiontomakeitmoreefficientandeffective. Thewindow sizes are then defined at 1/32 resolution. For example,let w be the window size. A window with w = 17 at 1/32 ground truth Res50 baseline output DML-Res50 output resolution means 4w = 68 at 1/8 resolution. The corre- spondingwindowfortheoriginalimageis32w =544. We Figure7.ExampleoutputsofRes50baselineandDML-Res50on usew1 =35,w2 =17andw3 =7foralltheexperiments. ADE20kdataset. Table1showsthelayerconfigurationwithResidualnet- work with 50 layers (Res50) as the base network. The As shown in Table 2, the model with dense multi-label low level feature block contains the layers from “conv1” (DML-Res50) yields a 2% improvement. To analyse the to “res3d”. The segmentation block and dense multi-label effectivenessoflabelsuppression, wealsousetwocriteria blocks have the layers from “res4a” to “res5c” as well to evaluate this performance, which are shown as “Wrong as some adaptive layers. It is worth noting that it does class”and“Wronglabels”. Wrongclassmeansthenumber not mean the segmentation block and dense multi-label classes that are not supposed to appear but are mistakenly blockswillsharetheweightseventhoughtheyinitializethe predicted by the model. Wrong labels describe how many weightsfromthesamelayers. Afterinitialization,theywill pixels are assigned with those wrong classes. We observe learntheirownfeaturesseparately. that using Dense multi-label effectively reduces the wrong classesandlabels,by35%and16%respectively. Someex- 4.ExperimentsandAnalysis amplesareshowninFigure7. Tomakefaircomparison,all theimagesarerawoutputsdirectlyfromthenetwork. The We evaluate our model on 4 commonly used semantic lastcolumnshowstheoutputsfromthenetworkwithdense segmentation datasets: ADE 20k, NYUDv2, SUN-RGBD multi-label where we can observe great scene consistency and PASCAL-Context. Our comprehensive experiments comparedwiththeoutputofthebaselinenetworkshownin showthatdensemulti-labelcansuccessfullysuppressmany themiddle. unlikelylabels,retainregionconsistencyandthusimprove Incomparisonwithothermethods,weachievebetterre- theperformanceofsemanticsegmentation. sultsthanthemodelsreportedin[26],asshowninTable3. The results are evaluated using the Intersection-over- MoreexamplescanbefoundinFigure8 Union(IoU)score[25]. Moreover,sinceouroriginalmoti- vationistosuppressnoisyandunreasonablelabelstokeep 4.2.ResultsonPASCAL-Context labels consistent with the region, we also introduce new measurementstoevaluatethenumberofclassesthatarenot PASCAL-Context dataset [27] is a set of additional an- in ground truth, and further, the number of pixels that are notationsforPASCALVOC2010,whichprovidesannota- predictedtobethesewrongclassesforeachimage. tionsforthewholescenewith60classes(59classesanda Model IOU #Wrongclass #Wronglabel Res50baseline 41.37 4.5 26308 DML-Res50 44.39 2.8 22367 Table 4. Results on PASCAL-Context dataset. The dense multi- label model increases the IOU by 3% and reduces the wrong classesandlabelsby37%and15%. Figure 9 shows some typical examples on this dataset. We can also see clear scene consistency with dense multi- label involved. The outputs in the middle contain many noisy classes, especially the lower middle image contains “bird” and “sky”, which are very unlikely in this scene. From Table 4, we can also see the great boost with dense multi-label. The wrong classes and labels are greatly re- ducedby37%and15%. input ground truth prediction Figure8.Moreexampleoutputsofdensemulti-labelnetworkon ADEdataset. Model IOU #Wrongclass #Wronglabel Res50baseline 34.5 5.576 21836 DMLRes50 36.49 3.6 18294 Table2.ResultsonADEdataset.Thedensemulti-labelbooststhe performanceby2%ofIOUandhelpsreducethenumberofwrong classandlabelby35%and16%respectively. Model IOU DilatedNet[26] 32.31 Cascade-DilatedNet[26] 34.90 DML-Res50(ours) 36.49 Table 3. Comparsion with other models on ADE dataset. Our modelachievesthebestperformance. input ground truth prediction Figure10.Moreexampleoutputsofdensemulti-labelnetworkon background class). It contains 4998 images in training set PASCAL-Contextdataset. and5105imagesinvalidationset. To compare with other models, we list several results on this dataset. Since different models have various set- bird sky building BG water mountain tings such as multi-scale training, extra data, etc. we also bird grass water grass bird grass water explain it in Table 5. Considering all the factors involved, our method is comparable since we only use Res50 as the base network and do not use mult-scale training and extra MS-COCOdataforpretraining. Moreexamplesareshown snow tree person in10. snow tree person bird sky mountain snow tree person 4.3.ResultsonNYUDv2 NYUDv2 [32] is comprised of 1449 images from a va- riety of indoor scenes. We use the standard split of 795 ground truth Res50 baseline output DML-Res50 output trainingimagesand654testingimages. Table 6 shows the results on this dataset. With dense Figure9.ExampleoutputsofRes50baselineandDML-Res50on multi-label,theperformanceisimprovedbymorethan1%, PASCAL-Contextdataset. andthenumberofwrongclassandlabeldecreasebyabout Model Base MS Exdata IOU 4.4.ResultsonSUN-RGBD FCN-8s[18] VGG16 no no 37.8 SUN-RGBD [33] is an extension of NYUDv2 [32], PaserNet[19] VGG16 no no 40.4 which contains 5285 training images and 5050 validation HO CRF[28] VGG16 no no 41.3 images,andprovidespixellabellingmasksfor37classes. Context[29] VGG16 yes no 43.3 VeryDeep[30] Res101 no no 44.5 Model IOU #Wrongclass #Wronglabel DeepLab[12] Res101 yes COCO 45.7 Res50baseline 39.28 5.3 24602 DML-Res50(ours) Res50 no no 44.39 DML-Res50 42.34 3.36 20104 Table 5. Results on PASCAL-Context dataset. MS means using Table8.ResultsonSUN-RGBDdataset. Densemulti-labelhelps multi-scaleinputsandfusingtheresultsintraining.Exdatastands increasetheperformancebymorethan3%ofIOUanddecrease forusingextradatasuchasMS-COCO[31].Comparedwithstate thewrongclassesandlabelsby36%and18%. oftheart,sinceweonlyuseRes50insteadofRes101anddonot usemulti-scaletrainingaswellasextradata,ourresultiscompa- Figure12showssomeoutputcomparisononthisdataset, rable. wherewecaneasilyobservetheeffectofdensemulti-label. TheresultsareshowninTable8. Thenetworkwithdense 40% and 16%. Some examples are shown in Figure 11. multi-label helps improve the IOU by more than 3%. The Sceneconsistencystillplaysanimportantroleinremoving wrongclassesandwronglabelsalsogetdecreasedby36% thosenoisylabels. Comparedwithsomeothermodels,we and 18% respectively. Compared with other methods, the achievethebestresult,asshowninTable7. network with dense multi-label reaches the best result, as shown in Table 9. More examples can be found in Figure Model IOU #Wrongclass #Wronglabel 13. Res50baseline 38.8 8.2 27577 DML-Res50 40.23 4.9 23057 Model IOU Table6.ResultsonNYUDv2dataset. Densemulti-labelnetwork Kendalletal.[34] 30.7 has1.4%higherIOUand40%and16%lowerwrongclassesand Context[29] 42.3 labelsrespectively. DML-Res50(ours) 42.34 Table9.ComparisonwithothermodelsonSUN-RGBDdataset. Weachievethebestresultwithdensemulti-labelnetwork. wall floor cabinet wall floor cabinet sink other counter wall floor cabinet sink other counter door chair sink other counter wall floor chair wall floor chair table door window wall floor chair table window desk curtain ceiling table window wall other picture wall other picture wall other picture bed lamp pillow bed lamp pillow bed lamp pillow wall floor chair night stand floor other furniture night stand wall floor chair table door curtain wall floor chair desk desk bookshelf desk ground truth Res50 baseline output DML-Res50 output ground truth Res50 baseline output DML-Res50 output Figure11.ExampleoutputsofRes50baselineandDML-Res50on Figure12.ExampleoutputsofbaselineRes50andDML-Res50on NYUDv2dataset. SUN-RGBDdataset. Model IOU 4.5.AblationStudyonPASCAL-Context FCN-32s[18] 29.2 FCN-HHA[18] 34.0 Table 10 shows an ablation study on the PASCAL- Context[29] 40.0 Context. TheRes50baselineyieldsmeanIOUof41.37%. Treating this as a baseline, we introduce dense multi-level DML-Res50(ours) 40.23 module. Firstly,intheonelevelsetting,weusethelargest Table7.ComparisonwithothermodelsonNYUDv2dataset.Our methodachievesthebestresult. windowsize,whichisbasicallyglobalmulti-labelclassifi- cation. Accordding to the results, the first level gives the BG bird cat dog horse building fence sky ground horse sky snow tree snow tree dog sky snow tree BG dog ground BG boat mountain cat rock mountain dog water mountain sky water sheep sky water sky ground input ground truth prediction Figure13.GoodexamplesonSUN-RGBDdataset. BG bird bicycle biggest boost. With 2 levels involved, the global and mid- grass bottle chair dog grass grass bird levelwindow,theperformanceisimprovedfurther. Thefi- nallevel,thesmallestwindow,brings0.6%moreimprove- ment. The dense multi-label module helps improve the performance by 2.2% in total. After using CRF as post- processing,wecanachieveIOUof44.39withoutusingex- traMSCOCOdataset. BG dog person rboiradd gbruoiludnindg person BG dog person ground dog sidewalk ground Model IOU Res50baseline 41.37 DML-Res501level 42.52 DML-Res502level 42.95 DML-Res503level 43.59 BG bird plant dog cloth flower DML-Res503level+CRF 44.39 BG dog cat food BG dog Table10.AblationstudyonPASCAL-Context. 4.6.FailureAnalysis ground truth Res50 baseline output DML-Res50 output We also observed some failure cases from the outputs, with two main types of failure shown in Figure 14. The Figure14.Examplesoffailedcase. left half of Figure 14 depicts a failure mode in which the objectsaretotallymisclassifiedintoanotherclass;herethe assigned lables are consistent due to the dense multi-label canenforcethesceneconsistencyinasimpleandeffective modulebuttheobject/regionclassiswrong.Anotherfailure way. More importantly, the dense multi-label is a module typeisshownintherighthalfofthefigure,wherethelabels andcanbeeasilyintegratedintoothersemanticsegmenta- are consistent but the model failed to detect some objects tionsystems. or detected some non-existing objects. In the former case, Intermsoffuturework,weconsiderinvestigatingbetter the error here appears primarily to be one exacerbated by waystocombinethedensemulti-labelmoduleandsegmen- thedensemulti-labelprediction.Thiscouldbemitigatedby tationsystem.Inotherwords,wemightconductresearchon improvingthequalityofdensemulti-labelpredictionand/or bettermethodstofusethepreditionsfromdensemulti-label adjusting the balance between the dense multi-label mod- andsegmentation. uleandthesegmentationpart. Weemphasizehowever,that thedensemulti-labeltechnicallycanbeintegratedintoany References segmentationsystemtohelpretaintheconsistency,andour resultsshowtheefficacyofdoingso. [1] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. RichFeatureHierarchiesforAccurateObjectDetec- 5.Conclusion tionandSemanticSegmentation.InProc.IEEEConf.Comp. Vis.Patt.Recogn.,pages580–587,2014. Inthisstudy, weproposeadensemulti-labelmoduleto address the problem of scene consistency. With compre- [2] Joa˜oCarreira,RuiCaseiro,JorgeBatista,andCristianSmin- hensiveexperiments,wehaveshownthatdensemulti-label chisescu. SemanticSegmentationwithSecond-OrderPool- ing. In Proc. Eur. Conf. Comp. Vis., volume 7578 LNCS, [17] YunchaoGong,YangqingJia,ThomasLeung,AlexanderTo- pages430–443,2012. shev, and Sergey Ioffe. Deep Convolutional Ranking for Multilabel Image Annotation. arXiv: Comp. Res. Reposi- [3] Bharath Hariharan, Pablo Arbela´ez, Ross Girshick, and Ji- tory,pages1–9,2013. tendra Malik. Simultaneous Detection and Segmentation. Proc.Eur.Conf.Comp.Vis.,pages297–312,2014. [18] JonathanLong,EvanShelhamer,andTrevorDarrell. Fully Convolutional Networks for Semantic Segmentation. In [4] Payman Yadollahpour, Dhruv Batra, and Gregory Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 3431– Shakhnarovich. Discriminative re-ranking of diverse 3440,2015. segmentations. Proc.IEEEConf.Comp.Vis.Patt.Recogn., pages1923–1930,2013. [19] Wei Liu, Andrew Rabinovich, and Alexander C. Berg. ParseNet: Looking Wider to See Better. arXiv preprint: [5] Cle´ment Farabet, Camille Couprie, Laurent Najman, and arXiv:1506.04579,pages1–11,2015. YannLecun. LearningHierarchicalFeaturesforSceneLa- beling.IEEETrans.PatternAnal.Mach.Intell.,35(8):1915– [20] Xiangyang Xue, Wei Zhang, Jie Zhang, Bin Wu, Jianping 1929,2013. Fan,andYaoLu. Correlativemulti-labelmulti-instanceim- ageannotation.Proc.IEEEInt.Conf.Comp.Vis.,pages651– [6] Michael Cogswell, Xiao Lin, Senthil Purushwalkam, and 658,2011. DhruvBatra. CombiningtheBestofGraphicalModelsand ConvNetsforSemanticSegmentation.arXiv,page13,2014. [21] KarenSimonyanandAndrewZisserman. Verydeepconvo- lutionalnetworksforlarge-scaleimagerecognition.InProc. [7] JifengDai,KaimingHe,andJianSun.[M]BoxSup:Exploit- Int.Conf.Learn.Representations,pages1–14,2015. ing Bounding Boxes to Supervise Convolutional Networks [22] Christian Szegedy, Sergey Ioffe, and Vincent Vanhoucke. forSemanticSegmentation,2015. Inception-v4, Inception-ResNetandtheImpactofResidual [8] SeunghoonHong,HyeonwooNoh,andBohyungHan. De- ConnectionsonLearning. coupledDeepNeuralNetworkforSemi-supervisedSeman- [23] KaimingHe,XiangyuZhang,SHaoqingRen,andJianSun. ticSegmentation. InProc.AdvancesinNeuralInf.Process. DeepResidualLearningforImageRecognition. 7(3):171– Syst.,pages1495–1503,2015. 180,2013. [9] FalongShenandGangZeng. FastSemanticImageSegmen- [24] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. tationwithHighOrderContextandGuidedFiltering. 2016. Imagenet classification with deep convolutional neural net- [10] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, works. Proc.AdvancesinNeuralInf.Process.Syst.,pages Kevin Murphy, and Alan L. Yuille. Semantic image seg- 1106–1114,2012. mentationwithdeepconvolutionalnetsandfullyconnected [25] MarkEveringham,LucVanGool,ChristopherKIWilliams, CRFs. InProc.Int.Conf.Learn.Representations,2015. JohnWinn,andAndrewZisserman.Thepascalvisualobject [11] GuoshengLin, ChunhuaShen, AntonvandanHengel, and classes(VOC)challenge.InternationalJournalofComputer Ian Reid. Efficient piecewise training of deep structured Vision,88(2):303–338,2010. modelsforsemanticsegmentation. 2016. [26] Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela [12] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Barriuso,andAntonioTorralba. SemanticUnderstandingof KevinMurphy,andAlanL.Yuille.Deeplab:Semanticimage ScenesthroughtheADE20KDataset. arXiv,2016. segmentationwithdeepconvolutionalnets,atrousconvolu- [27] Roozbeh Mottaghi, Xianjie Chen, Xiaobai Liu, Nam-gyu tion,andfullyconnectedCRFs. arXiv: Comp.Res.Reposi- Cho,Seong-whanLee,RaquelUrtasun,andAlanYuille.The tory,2016. RoleofContextforObjectDetectionandSemanticSegmen- [13] Yunchao Wei, Wei Xia, Junshi Huang, Bingbing Ni, Jian tationintheWild. 2010. Dong, YaoZhao, andSeniorMember. CNN:Single-label [28] AnuragArnab,SadeepJayasumana,ShuaiZheng,andPhilip to Multi-label. arXiv: Comp. Res. Repository, abs/1406.5, Torr.HigherOrderConditionalRandomFieldsinDeepNeu- 2014. ralNetworks. Arxiv,page10,2015. [14] JiangWang,YiYang,JunhuaMao,ZhihengHuang,Chang [29] GuoshengLin,ChunhuaShen,AntonVanDenHengel,and Huang, andWeiXu. CNN-RNN:Aunifiedframeworkfor IanReid. Exploringcontextwithdeepstructuredmodelsfor multi-labelimageclassification. InProc.IEEEConf.Comp. semanticsegmentation. Arxiv2016,pages1–14,2016. Vis.Patt.Recogn.,2016. [30] Zifeng Wu, Chunhua Shen, and Anton Van Den Hengel. [15] Carl A Moore, Michael A Peshkin, and J Edward Colgate. BridgingCategory-levelandInstance-levelSemanticImage HCP: A Flexible CNN Framework for Multi-Label Image Segmentation. 2016. Classification. IEEE Trans. Pattern Anal. Mach. Intell., [31] Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir D 38(2):1901–1907,2016. Bourdev,RossBGirshick,JamesHays,PietroPerona,Deva [16] Yuhong Guo and Suicheng Gu. Multi-label classification Ramanan,PiotrDolla´r,andCLawrenceZitnick. Microsoft usingconditionaldependencynetworks. InProc.Int.Joint coco: Commonobjectsincontext. arXiv:1405.0312,pages Conf.ArtificialIntell.,pages1300–1305,2011. 740–755,2014. [32] NathanSilberman,DerekHoiem,PushmeetKohli,andRob Fergus. Indoor segmentation and support inference from RGBD images. Lecture Notes in Computer Science (in- cluding subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 7576 LNCS(PART 5):746–760,2012. [33] Shuran Song, Samuel P. Lichtenberg, and Jianxiong Xiao. SUN RGB-D: A RGB-D scene understanding benchmark suite.Proc.IEEEConf.Comp.Vis.Patt.Recogn.,pages567– 576,2015. [34] Alex Kendall, Vijay Badrinarayanan, and Roberto Cipolla. Bayesian SegNet: model uncertainty in deep convolu- tional encoder-decoder architectures for scene understand- ing. arXiv:1511.02680v1[cs.CV],2015.