Weakly Supervised Learning for Salient Object Detection HuaizuJiang∗, etal. Abstract Recent advances in supervised salient object detec- 5 1 tionhasresultedinsignificantperformanceonbenchmark 0 datasets. Training such models, however, requires expen- 2 sive pixel-wise annotations of salient objects. Moreover, p many existing salient object detection models assume that e at least one salient object exists in the input image. Such S anassumptionoftenleadstolessappealingsaliencymaps 8 onthebackgroundimages,whichcontainnosalientobject at all. To avoid the requirement of expensive pixel-wise ] salient region annotations, in this paper, we study weakly V supervisedlearningapproachesforsalientobjectdetection. C Givenasetofbackgroundimagesandsalientobjectimages, . (a)input (b)SF[32] (c)DRFI[20] (d)RBD[48] s weproposeasolutiontowardjointlyaddressingthesalient Figure1.Saliencymapsproducedbythreestate-of-the-artmodels c object existence and detection tasks. We adopt the latent onthebackgroundimages. [ SVMframeworkandformulatethetwoproblemstogetherin 2 asingleintegratedobjectivefunction:saliencylabelsofsu- v perpixelsaremodeledashiddenvariablesandinvolvedin markdatasets[20]. Yetitistimeconsumingandtediousto 2 aclassificationtermconditionedtothesalientobjectexis- annotate salient objects in order to train a model. On the 9 4 tencevariable,whichinturndependsonbothglobalimage otherhand,itisusuallyassumedthatatleastonesalientob- 7 and regional saliency features and saliency label assign- jectexistsintheinputimagebymostexistingsalientobject 0 ment. Experimentalresultsonbenchmarkdatasetsvalidate detectionalgorithms(See[3]).However,asshowninFig.1, . 1 theeffectivenessofourproposedapproach. there exist background images [40], where there are no 0 salientobjectsatall. Basedonthisimpracticalassumption, 5 allofthreestate-of-the-artapproaches[32,20,48]produce 1 1.Introduction inferiorsaliencymapsonbackgroundimages. Tothisend, : v westudyhowtoutilizeweaklylabeleddatatotrainsalient i Attention!!! Therewasamistakeofone-classSVMfor X objectdetectionmodels. Givenasetofbackgroundimages salientobjectdetectioninthepreviousvisionsincetheker- andsalientobjectimages,whereweonlyhaveannotations r neltrikisusedforthetrainingphraseonly. a ofsalientobjectexistencelabels,ourgoalistotrainasalient Salient object detection, deriving from classical hu- objectdetectionmodel. man fixation prediction [16], aims to separate the entire In this paper, we propose a weakly supervised learning salient object(s) that attract most of humans’ attention in approach to jointly deal with salient object existence and thescenefromthebackground[25]. Drivenbyapplications detectionproblems. Theinputimageisfirstsegmentedinto of saliency detection in computer vision, such as content- a set of superpixels1. Saliency labels of superpixels (i.e., aware image resizing [2] and photo collection visualiza- foregroundorbackground)arethenmodeledashiddenvari- tion[39],manycomputationalmodelshavebeenproposed ables in the latent structural SVM framework, where the inthepastdecade. inference can be efficiently solved using the graph cut al- There are two main motivations behind this paper. On gorithm [7]. The training problem is built upon the large- one hand, recent advances in supervised salient object de- marginlearningframeworktoseparatethesalientobjectim- tection has resulted in significant performance on bench- agesandthebackgroundimages. Ourproposedweaklysu- ∗Main part of this work was done when Huaizu Jiang was at Xi’an JiaotongUniversity. 1Inthispaper,weusethetermssuperpixelandregioninterchangeably. 1 pervised approach is based on a set of unsupervised meth- Ourproposedweaklysupervisedapproachisbuiltupon ods[9,32,45,48]. Comparedwithsupervisedapproaches, thebasisofthefeatureengineeringofseveralunsupervised we do not require strong pixel-wise salient object annota- approaches[9,32,45,48]. Comparedwithsupervisedap- tions. Furthermore,ourapproachiscapableofrecognizing proaches, however, our approach does not rely on strong theexistenceofsalientobjects. saliency annotations, where we merely utilize the weak Our main contributions therefore are two folds: (i) we salient object existence labels of training images. More- propose a weakly supervised learning approach based on over, our proposed latent salient object detection approach the latent structuralSVM framework,instead of expensive (Sec. 3) is capable of jointly addressing the salient object salientobjectannotations;(ii)comparedwithconventional existenceanddetectionproblems. approaches,ourproposedapproachiscapableofjointlyad- Salientobjectexistenceprediction. In[40],thesalient dressing salient object existence and detection problems. objectexistencepredictionproblemisstudiedasastandard Our approach performs better than most of unsupervised binaryclassificationproblembasedonglobalsaliencyfea- salientobjectdetectionmodelsandiscomparablewiththe tures of thumbnail images. Zhang et al. [46] investigate bestsupervisedapproach. not only existence but also counting the number of salient objects based on holistic cues. In this paper, we focus on 2.RelatedWork recognizing salient object existence. By incorporating la- tentsuperpixels’saliencylabelinourapproach,betterper- Inthissection,webrieflyintroducerelatedworksintwo formance than [40] can be achieved. Moreover, salient areas: salientobjectdetectionandweaklysupervisedlearn- object existence labels are used to train a weakly super- ingforvisiontasks. visedsalientobjectdetectionmodel,predictingsuperpixels’ Salient objectdetection. We refer readers to [4, 6]for saliencyscores. acomprehensivereviewofsalientobjectdetectionmodels. Weaklysupervisedlearning. Visualdatathatareubiq- Here,webrieflyintroducesomeofthemostrelatedworks. uitouslyavailableonthewebareinnatureweaklylabeled, Visualsaliencyisusuallyrelatedtotheuniqueness,dis- e.g., images on Flickr and videos on YouTube with tags. tinctiveness,anddisparityofthescene. Consequently,most To leverage these data, weakly supervised learning meth- of existing works focus on designing models to capture ods are extensively studied for vision tasks such as object the uniqueness of the scene in an unsupervised setting. detection [31, 12], concept learning [36], scene classifica- Theuniqueness canbecomputedfor eachpixelinthe fre- tion[14,31],semanticimagesegmentation[38],etc. quencydomain[1],bycomparingapatchtoitsmostsimi- In essence, our proposed approach is closely related to larones[15],orbycomparingapatchtotheaveragepatch the work of visual concept mining from weakly labeled oftheinputimageintheprincipalcomponentsspace[28]. data[35], wherewelabelthetestdatabasedonastrongly Benefitingfromimagesegmentationalgorithms, moreand annotatednegativetrainingdata. Comparedwith[35], our moreapproachestrytocomputetheregionaluniquenessin approach is more suitable for salient object detection. In a global manner [9, 32, 5], based on multi-scale [19] and addition,ourlatentsalientobjectdetectionbasedonthela- hierarchical segmentations of the image [44]. Moreover, tentstructuralSVMiscloselyrelatedtothehidden[33]and several priors about a salient object have been developed max-margin[41]conditionalrandomfields. in recent years. Since a salient object is more likely to be placednearthecenteroftheimagetoattractmoreattention 3.WeaklySupervisedSalientObjectDetection (i.e.,photographerbias),itisnaturaltoassumethatthenar- row border of the image belongs to the background. Such In this section, we first present a weakly supervised a background prior is widely studied [42, 45, 18, 24]. It approach for salient object detection based on the latent is recently extended to the background connectivity prior structural SVM framework (Sec. 3.1). We then introduce assuming that a salient object is less likely connected to saliencyfeaturesusedforsalientobjectexistenceprediction the border area [48, 47]. In addition, generic objectness anddetectiontasks(Sec.3.2). priorisalsoutilizedforsalientobjectdetection[8,21,17]. 3.1.ALatentStructuralSVMFormulation Otherpriorsincludespatialdistribution[32,10]andfocus- ness[21]. In this paper, we are interested inlearning a model that Therealsoexistsupervisedsalientobjectdetectionmod- can not only predict whether there exist salient objects in els. The Conditional Random Field [25, 27] and Large- theinputimagebutalsowherethesalientobjects(regions) Margin framework [26] are adopted to learn the fusion are (if they exist). Our weakly annotated training data is weightsofsaliencyfeatures.Integrationofsaliencyfeatures composedofasetofimagesandtheirground-truthannota- canalsobediscoveredbasedonthetrainingdatausingRan- tions of salient object existence labels (i.e., salient object dom Forest [20], Boosted Decision Trees (BDT) [29, 23], images vs. background images). Unlike supervised ap- andmixtureofSupportVectorMachines[22]. proaches[20,23],ourapproachdoesnotneedground-truth annotationsofregionalsaliencylabelsofthetrainingsam- termsforeachregiontobeforegroundandbackground,re- ples. We call our approach weakly supervised salient ob- spectively. jectdetectionsincethesupervisioncomesmerelyfromthe In the above formulation, both salient object existence salient object existence annotations. It is worth exploring predictionanddetectionproblemsaremodeledtogetherin weaklysupervisedlearningsinceitrequiresfarlessannota- a single integrated objective function. Salient object exis- tioneffortthanasupervisedone. tence label does not only depend on the global image fea- DenotetheinputimageasI,whichconsistsofN super- turesΦe(I)inastandardclassificationterm,butalsoonthe pixels {ri}Ni=1. Salient object existence label of the image regional saliency labels h and features Φfj(I) and Φbj(I). isrepresentedbyabinarylabely ∈ Y, whereY = {0,1} Although we are at the same supervision level as existing denotes if there exist salient objects (0 for no existence). supervisedmodelsofpredictingsalientobjectexistencela- Regional saliency labels of the image are denoted as h = bels[40,46],regionalsaliencylabelsaretakenintoconsid- [hi]Ni=1,wherehi ∈H={0,1}indicatesthesaliencylabel erationaslatentvariablesinourapproach. forthesuperpixelri(0isforbackground). In turn, regional saliency labels h are dependent on the Givenasetoftrainingsamples{(Im,ym)}Mm=1,ourgoal salientobjectexistencelabelyaswell.Welearntwogroups istolearnamodelthatcanbeusedtopredictthesalientob- of model parameters for salient object detection on salient ject existence label y as well as regional saliency labels h objectimagesandbackgroundimages,respectively. More- of an unseen test image. To this end, we learn a discrim- over, we learn two prior terms wf and wb modeling the inative function fw : I ×Y → R over the image I and influence of salient object existenace labelay on the latent itssalientobjectexistencelabely,wherewaretheparam- salient object detection h. The last smoothness term en- eters. During testing, we can use fw to predict the class courages adjacent regions to take the same saliency label. labely∗ oftheinputimageasy∗ = argmaxy∈Yfw(I,y). vjk captures the similarity of two neighboring regions rj Due to lack of annotations, we model regional saliency −|cj−ck|2 labels h as hidden variables in the latent structural SVM and rk. It is defined as vjk = e 2σc2 , where cj is the framework. We assume fw(I,y) takes the following form averagecolorvectorofthesuperpixelrj andparameterσc f (I,y) = max (cid:104)w,Ψ(I,y,h)(cid:105), where Ψ(I,y,h) is a issetmanually. w h feature vector depending on the input image I, its salient 3.2.SaliencyFeatures objectexistencelabely,andregionalsaliencylabelsh. We consider the global features Φe(I) of the input im- Inthepastdecade,reserchershavebeenmainlyconcen- age I to capture the salient object existence in a holistic tratingondesigningvariousfeaturestodescribesalientob- manner as in [40, 46]. Additionally, each superpixel r is i jects.Inspiredby[9],moreandmoreresearcheffortisspent representedbytwofeaturevectorsΦf(I)andΦb(I),mod- i i attheregionlevel. Inthispaper,weconsiderthefollowing eling its negative log-likelihood of belonging to the fore- fivekindsofregionalsaliencyfeatures. ground and background, respectively. Their detailed defi- Globalcontrast. Asstudiedin[9,32,5],themoredis- nitions are introduced in Sec. 3.2. To account for the spa- tinctaregionfromothers,themoresalientitmightbe. Re- tialconstraintsoftwoadjacentsuperpixelsthattheytendto gionalglobalcontrastΦgc(I)iscomputedbycomparingthe share the same saliency labels, we construct an undirected i region r to others, where nearby regions are given larger i graph G = (V,E). The vertex j ∈ V corresponds to the weightstodeterminethecontrastvalue. saliency configuration of the superpixel r and (j,k) ∈ E j Spatial distribution. It is also an extensively studied indicates the spatial constraints of superxpixles r and r . j k saliency feature [25, 32, 10, 5], indicating that the wider a Finally,(cid:104)w,Ψ(I,y,h)(cid:105)isdefinedasfollows, region spreads over the image, the less salient it is. Fol- (cid:104)w,Ψ(I,y,h)(cid:105)= (cid:88)δ(y =a)(cid:104)wae,Φe(I)(cid:105) lowing[32],wecomputethespatialdistributionΦsip(I)by computing spatial distances of the region r with others, a∈Y i (cid:88) (cid:88) (cid:16) (cid:17) whichareweightedbytheirappearancedistances. + δ(y =a) δ(h =1) (cid:104)ws,Φf(I)(cid:105)+wf j a j a Backgroundness. Sincethesalientobjectisplacednear a∈{0,1} j∈V theimagecentertoattractmoreattention,theimageborders + (cid:88) δ(y =a)(cid:88)δ(h =0)(cid:0)(cid:104)ws,Φb(I)(cid:105)+wb(cid:1) B arethus morelikelybelongto thebackground. Follow- j a j a ing [20], the regional backgroundness Φbg(I) is computed a∈{0,1} j∈V i − (cid:88) δ(h (cid:54)=h )wp·v . (1) byexaminingtheregionri withrespecttoB basedondif- j k jk ferentappearancefeatures. (j,k)∈E Manifold ranking. In addition to directly comparing The model parameters w are the concatenation of the pa- eachregiontotheimageborderB,aregion’ssaliencyscore rametersofallthefactorsintheaboveequation, i.e., w = can also be defined based on its relevance to B via graph- [we,ws,,wf,wb,wp] ,wherewf andwb aretwoprior basedmanifoldranking[45]. Following[45], wecompute a a a a∈Y a a (a) (b) (c) (d) (e) (f) (g) Figure2.IllustrationofsaliencyfeaturescomputedontheLabcolorhistogramchannel. Fromlefttoright: (a)inputimages,(b)global contrast,(c)spatialdistribution,(d)backgroundness,(e)manifoldranking,(f)boundaryconnectivity,and(g)finalsaliencymaps. the ranking score for the region r w.r.t each side of the gion is more likely to be categorized as foreground with i imageborderB andcombinethemtogethertogetthefinal largersaliencyfeaturevalues. manifoldrankingscoreΦmr(I). i 4.LearningandInference Boundary connectivity. It is suggested in [48] that a salient region is less likely connected to the pseudo- Inthissection,weintroducehowtolearnourmodelpa- backgroundB.Tothisend,theboundaryconnectivityscore rameters w from training samples (Sec. 4.1) and how to Φbic(I) of the region ri is defined as the ratio between its infer both the salient object existence label y and regional spanningareaandthelengthalongtheimageborder. saliencylabelshgivenatestimage(Sec.4.2). For robustness, we compute saliency features on differ- 4.1.LargeMarginLearning entappearancechannelsincludingaverageRGB,RGBhis- togram, average HSV, HSV histogram, average Lab, Lab Givenasetoftrainingsamples{(I ,y )}M ,wefind m m m=1 histogram,andLocalBinaryPattern(LBP)histogram. Fea- theoptimalmodelparametersbyminimizingthefollowing ture distances are computed as the χ2 distance for his- regularizedempiricalrisk[13], tograms and as absolute Euclidean distance for others. M Each dimension of the feature is normalized in the range min L(w)= λ||w||2+ 1 (cid:88) R (w), (2) [0,1]. Finally,weconcatenatethesefivefeaturedescriptors w 2 M m m=1 Φs(I) = [Φgc(I),Φsp(I),Φbg(I),Φmr(I),Φbc(I)]. Into- i i i i i i where λ controls the trade off between the regularization tal, we obtain a 35-dimensional feature vector. See Fig. 2 term and the loss term. R (w) is a hinge loss function forexamplesofdifferentsaliencyfeatures.Wereferreaders m definedas totheoriginalpapersformoretechnicaldetails. Based on saliency features {Φs(I)}N , we adopt the Rm(w)=max((cid:104)w,Ψ(Im,y,h)(cid:105)+∆(ym,y,h)) i i=1 y,h same holistic manner as in [40] to capture the existence −max(cid:104)w,Ψ(I ,y ,h)(cid:105), (3) of salient objects. We resize the pixel-wise saliency map m m h resulting from each appearance channel of {Φs(I)}N to i i=1 wherethelossfunction∆(y ,y,h)isdefinedasfollows m 300×300anddivideitinto5×5grids,concatenatingthe averagesaliencyvalueineachgridtoformaglobalsaliency ∆(ym,y,h)=δ(ym (cid:54)=y)+α(ym,h). (4) feature vector ΦGS(I). Additionally, we also consider the The first term is the 0/1 loss widely used for multi-class GISTdescriptor[30]ΦGIST(I),computedasaconcatena- classification. In addition, we introduce the second term tion of averaged responses of 32 Garbor-like filters over a to constrain the latent salient object segmentation. For a 4×4grid.Finally,wegeta1387-dimensional(5×5×35+ backgroundimage,itsregionalsaliencylabelsshouldbeall 32×4×4)featurevectorΦe(I)=[ΦGS(I),ΦGIST(I)]to zeros. Forasalientobjectimage, weresorttothepseudo- capturesalientobjectexistence. background prior [20] to treat all the saliency labels of re- WealsodefineΦfi(I)=−log(1−Φsi(I))andΦbi(I)= gionsintheborderareaoftheimageaszeros. Tothisend, −log(Φsi(I)), which can be regarded as the negative log- thesecondlosstermcanbewrittenas clbirakeceaklsighersoooaudsndiot,frreaeaisscpehescrwteivgheiiollyen.ΦbSebiil(noIcn)egdiΦnesigc(reItoa)ste∈hse,[0fino,dr1ei]cg,arΦotiufinn(gIda)airnned-- α(ym,h)=(cid:40) ZZ1101 (cid:80)(cid:80)NlNl==11ββllδδ((hhll (cid:54)=(cid:54)=00)),δ(rl ∈B), iiffyymm ==01,, whereβ istheareaoftheregionr . Z andZ arenormal- Table1.Taxonomyofdifferentsalientobjectdetectionalgorithms l l 0 1 basedonsupervisiontypeandtasksthateachmethodcansolve. izationtermstoensureα(y ,h)∈[0,1]. m (Abbreviationsunspvd.andspvd.denoteunsupervisedandsuper- Eq.2canbeefficientlyminimizedusingthebundleop- vised,respectively.) timizationmethod[13],whichiterativelybuildsanincreas- methods supervision task pub.&year inglyaccuratepiecewisequadraticapproximationoftheob- SVO[8] unspvd. detection ICCV2011 jectivefunctionL(w)basedonitssub-gradient∂L(w).We CA[15] unspvd. detection CVPR2010 firstdefine CB[19] unspvd. detection BMVC2011 RC[9] unspvd. detection PAMI2015 h∗y =argmax((cid:104)w,Ψ(Im,y,h)(cid:105)+∆(ym,y,h)),∀m,∀y ∈Y, SF[32] unspvd. detection CVPR2012 h y∗ =argmax(cid:0)(cid:104)w,Ψ(I ,y,h)(cid:105)+∆(y ,y,h∗)(cid:1), (5) LRK[34] unspvd. detection CVPR2012 m m m y HS[44] unspvd. detection CVPR2013 y∈Y GMR[45] unspvd. detection CVPR2013 Thesub-gradient∂L(w)canthenbecomputedas PCA[28] unspvd. detection CVPR2013 MC[18] unspvd. detection ICCV2013 ∂L(w)=λw+Ψ(I ,y∗ ,h∗ )−Ψ(I ,y ,h∗ ). DSR[24] unspvd. detection ICCV2013 m m ym∗ m m ym RBD[48] unspvd. detection CVPR2014 Given the sub-gradient ∂L(w), the optimal model param- DRFI[20] spvd. detection CVPR2013 eters can then be learned by minimizing Eq. 2 using the HDCT[23] spvd. detection CVPR2014 methodin[13]. GS[40] spvd. existence CVPR2012 localization 4.2.Inference existence SOS[46] spvd. CVPR2015 counting Given a test image I, we maximize Eq. 1 to jointly weaklyspvd. detection predict its salient object existence label y∗ and regional LSSVM spvd.+latent existence saliencylabelsh∗asfollows, (y∗,h∗)=arg max(cid:104)w,Ψ(I,y,h)(cid:105). (6) y∈Y,h (e.g., 400 × 300), this dataset is not suitable for our sce- narios. To this end, we collect 6182 background images SincethesearchspaceY ofy issmall,wecaniterateover fromtheSUNdataset[43],describabletexturedataset[11], allitspossiblevalues.Givenanyy ∈Y,weutilizethemax- Flickr,andBingimagesearchengines. Werandomlysam- flowalgorithm[7]tooptimizetheEq.1togettheoptimal ple 5000 background images to train our model and leave regionalsaliencylabels. other 1182 images for testing. Additionally, we randomly Duringtraining,wehavetosolvetheloss-augmenteden- sample 5000 images from the MSRA10K dataset [9] for ergyfunctionEq.6. Luckily,wecanincorporatethelossof training and 1237 images for testing. In total, we have regionalsaliencylabelsintotheunarytermofEq.1. There- 10000imagesfortrainingand2419fortesting. fore, we can again utilize the max-flow algorithm [7] for For the salient object detection task, we evaluate our efficientinference. proposed approach (LSSVM) on MSRA-B [20] and EC- Tooutputasaliencymap,wediffusethelatentsegmenta- SSD [44] datasets with pixel-wise annotations. MSRA- tionresultofsalientobjectusingthequadraticenergyfunc- B contains 5000 images with variations including natural tion[26]asfollows, scenes,animals,indoorscenesetc. Thereare1000seman- z=γ(I+γL)−1Ih, (7) tically salient but structurally complex images in ECSSD, makingitverychallenging. where z = [zi]Ni=1. zi ∈ [0,1] is the saliency value of We compare our approaches with 14 state-of-the-art the superpixel ri. I is the identity matrix. V = [vij] salientobjectdetectionmodels, including12unsupervised andD=diag{d11,··· ,dNN}isthedegreematrix,where methods and 2 supervised models, which are summarized (cid:80) dii = jvij. L=D−VistheLaplacianmatrix. in Tab. 1. Following the benchmark [6], for quantita- tive comparisons, we binarize a saliency map with a fixed 5.ExperimentalResults threshold ranging from 0 to 255. At each threshold, we compute Precision and Recall scores. We can then plot a 5.1.Setup Precision-Recall(PR)curve. Toobtainascalarmetric, we Background images publicly available in the literature reporttheaverageprecision(AP)scoredefinedasthearea areonlythethumbnailbackgroundimagedataset[40]. Im- underthePRcurve. Additionally,wealsoreporttheMean agesinthisdataset, however, areoflowresolution(130× Absolute Error (MAE) scores between saliency maps and 130). Sinceweareinterestedinimageswithcommonsizes theground-truthbinarymasks. Accuracy of Salient Object Existence7890000 N2u00m0ber o4f0 (0Ta0ra)inin6g0 0Im0age8s00EMT0CeSSsRtS ASD-eBt AP Scores0000....8567 MECSSR2SA0DN0-B0umbe4r0 (o0b0f T)rain60in0g0 Ima8g0e0s0 Accuracy of Existence Prediction246800000 NNNNNFTuoooooel lsGSBMBtpaol aoScaunebktnitigafadolrl ao ClDdruoyi Rnsn Ctadt(rorninacbnkMesunis)tnStNesigRcooAtn iBv-Boituyndary CEoCnSnSeDctivity AP Scores0000....02468 NNNNNFuoooool lGSBMBMpaolaocaunSbktniiRgafadolrAl ao ClDdr-uoyBi Rnsn Ctadtr(orninabdnkesunistn)tesigcotnivity ECSSD Figure3. Empiricalanalysisof ourapproach onthetest set, MSRA-B,and ECSSDdatasets. From toptobottom: (a)(b): accuracyof salientobjectexistencepredictionandAPscoresofsalientobjectdetectionversusdifferentnumberoftrainingimages(MinEq.2),(c)(d): accuracyofsalientobjectexistencepredictionandAPscoresofsalientobjectdetectionversusdifferentsettingsoffeaturecombinations. 5.2.EmpiricalAnalysisofOurApproach Table2.Classificationaccuracyofdifferentapproachesonbench- markdatasets.(Updated.) Here we empirically analyze our proposed approach on TestSet MSRA-B ECSSD thetestset,MSRA-BandECSSDdatasets.Inparticular,we linearSVM 90.20 87.84 75.20 quantitatively study the performance of both salient object χ2SVM 93.14 90.80 81.80 detection and salient object existence prediction tasks by rbfSVM 95.37 93.16 82.90 varyingthefollowingparameters. RF 92.24 91.52 84.50 Number of Training Images. As can be seen [40] 90.64 89.26 72.50 from Fig. 3(a), the latent structural SVM benefits from LSSVM 93.96 90.82 76.90 largernumberoftrainingsamples,wheretheclassification χ2LSSVM 95.58 92.54 79.90 accuracy almost keeps increasing when more training im- agesareadoptedonallthreedatasets. However,according toFig.3(b),theperformanceofsalientobjectdetectiondoes However, since both the rbf SVM, χ2 SVM and Random notalwaysincreasewhenmoretrainingsamplesareavail- Forest are non-linear classifiers, they perform better than able. Thereasonmightbethatincontrasttothesalientob- ourapproach. Thismotivatesusthatourapproachmayfur- ject existence, we have indirect (weak) supervision during therbenefitfromnon-linearlytransformingourglobalfea- trainingtoconstrainthesalientobjectsegmentationresults. tures (via a kernel function). Therefore, we train a non- FeatureImportance.Tomeasuretheimportanceoffea- linear version of LSSVM (denoted as χ2 LSSVM), where tures, we remove each kind of feature set and observe the weusetheexplicitfeaturemapping[37]totransformΦe(I) performance variations on both tasks. In terms of salient to approximate the χ2 kernel. As can be seen, benefit- objectexistenceprediction,accordingtoFig.3(c),thefea- ing from latent variables, its classification accuracy is still tureimportanceonthreedatasetsarediverse. Forinstance, higher than its baseline (χ2 SVM) on both MSRA-B and backgroundnessisrecognizedasthemostimportantonthe ECSSDdatasets. Moreover,itachievesthehighestclassifi- testsetwhileconsideredastheleastcriticaloneonMSRA- cationaccuracyonthetestset. B. Regarding the salient object detection tasks according Comparedwiththestate-of-the-artapproachin[40],our to Fig. 3(d), the ranking of feature importance is consis- approach has two advantages, more powerful features and tentonMSRA-BandECSSD.Features,fromthemostim- incorporationoflatentsaliencyinformation. Thoughanon- portant to the least important are: boundary connectivity, linear classifier (Random Forest) is utilized in [40], as we global contrast, backgroundness, spatial distribution, and canseefromTab.2,ourapproachhashigherclassification manifold ranking. It is worth noting that the full feature accuracyonalldatasets.Moreover,comparedwith[40],our vectorperformsthebest. approach is able to jointly address salient object existence anddetectionproblems. 5.3.SalientObjectExistencePrediction 5.4.SalientObjectDetection Here we quantitatively study our proposed approach in terms of the salient object existence prediction task. We In this section, we compare our LSSVM approach with compareourapproachwiththreebaselines,wherewetrain other state-of-the-art salient object detection approaches. a linear SVM, two non-linear SVMs (using the χ2 and Our LSSVM approach is designed to address the limit of rbf kernels, respectively), and a Random Forest using our conventionalapproaches, wheretheyimpracticallyassume global image features Φe(I). As we can see in Tab. 2, thatatleastonesalientobjectexistsintheinputimage. For by considering latent variables, our proposed approach morefairercomparisons,weintroduceatwo-stagescheme (LSSVM)canachievehigheraccuracythanthelinearSVM. tomakecomparisonsfairer.Specifically,wefirstpredictthe (a)input (b)SF[32] (c)GMR[45] (d)DSR[24] (e)RBD[48] (f)HDCT[23] (g)DRFI[20] (h)LSSVM Figure4.Qualitativecomparisonsofsaliencymapsproducedbydifferentapproaches.Fromlefttoright:(a)inputimages,(b)-(g)saliency mapsofstate-of-the-artapproaches,(h)saliencymapsofourproposedapproachLSSVM. Table 3. AP and MAE scores compared with state-of-the-art ap- existence label of salient objects using the rbf SVM intro- proaches on different benchmark datasets, where supervised ap- ducedinSec.5.3. Iftherearenosalientobjects,weoutput proaches are marked with bold fonts. The best three scores are anall-blacksaliencymap. Otherwise,wegeneratesaliency highlighted with red, green, and blue fonts, respectively. (Up- mapsusingdifferentapproaches. dated.) InadditiontoMSRA-BandECSSDbenchmarkdatasets, AP MAE we check performance of different approaches on the test MSRA-B ECSSD MSRA-B ECSSD TestSet setconsistingof1237salientobjectimagesand1182back- rbfSVM+SVO 0.631 0.458 0.333 0.388 0.212 ground images. Since ground-truth annotations of back- rbfSVM+CA 0.512 0.390 0.241 0.326 0.101 ground images are all-black images, only MAE scores are rbfSVM+CB 0.652 0.483 0.184 0.275 0.111 feasible to report on the test set. See Tab. 3 and Fig. 5 for rbfSVM+RC 0.672 0.506 0.135 0.233 0.093 quantitativecomparisons. rbfSVM+SF 0.607 0.473 0.168 0.270 0.052 Since an all-black saliency map is generated for the in- rbfSVM+LRK 0.680 0.483 0.207 0.295 0.118 put that is classified as a background image, precision and rbfSVM+HS 0.631 0.479 0.153 0.258 0.104 recallscoresareallzerosatallthresholdsbut0(therecall rbfSVM+GMR 0.709 0.517 0.126 0.235 0.085 scoreis1whenthethresholdis0, indicatingallpixelsare rbfSVM+PCA 0.666 0.468 0.185 0.282 0.080 recognized as salient). This is why PR curves become flat rbfSVM+MC 0.701 0.509 0.142 0.247 0.101 whentherecallapproachesto1. rbfSVM+DSR 0.694 0.524 0.119 0.229 0.076 We can see in Fig. 5 that our approach PR curves are rbfSVM+RBD 0.732 0.530 0.113 0.226 0.080 higher than others on most places. To this end, the linear rbfSVM+DRFI 0.732 0.548 0.129 0.231 0.101 version (LSSVM) outperforms other unsupervised and su- rbfSVM+HDCT 0.707 0.502 0.148 0.250 0.112 pervisedapproachesonbothMSRA-BandECSSDdatasets LSSVM 0.748 0.573 0.129 0.237 0.086 intermsAPscores. Augmentedwiththeexplicitχ2 kernel χ2LSSVM 0.780 0.578 0.123 0.231 0.097 featuremapping,betterperformancecanbeachieved,indi- cating that the salient object existence and detection prob- lemscanbemutuallybeneficialbymodelingtheminauni- InFig.4,weprovidequalitativecomparisonsofourap- fied framework. Specifically, χ2 LSSVM performs better proach and other top performing approaches. As can be than the second best method by 6.8% (RBD) on MSRA- seen,ourLSSVMapproachcanproduceappealingsaliency Bandby5.5%(DRFI)onECSSD.WhiletheMAEscores mapsonimageswheresalientobjectstouchtheimagebor- arenotassuperiorastheAPscores, χ2 LSSVMisranked der, although we utilize the background prior to extract as the third best on both MSRA-B and ECSSD datasets. regional saliency features and constrain the latent salient Thereasonwhyitperformsinferioronthetestsetmightbe object detection. Moreover, on background images, our thatourapproachcannotalwaysproduceall-blacksaliency LSSVM approach generates near all-black saliency maps, mapsforbackgroundimagesasothermethods2. clearlydenotingnoexistenceofsalientobjects. 2Recallthatweproduceanall-blacksaliencymapifrbfSVMrecog- On a PC equipped with an Intel i7 CPU (3.4GHz) and nizesaninputasabackgroundimage. 32GBRAM,ittakesabout12htotrainourapproachusing 0.8 SVO CA CB 0.6 RC n SF Precisio0.4 LHGPRCSMKAR MC DSR RBD 0.2 DRFI HLSDSCVTM MSRA-B (a) (b) (c) (d) χ2 LSSVM Figure 6. Failure cases of our LSSVM approach. Top row is a 0 0 0.2 0.4 0.6 0.8 1 salientobjectimagethatisincorrectlyrecognizedasabackground Recall image. Bottom row is a background image mis-classified as a 0.8 salientobjectimage. Fromlefttoright: (a)inputimages, (b)(c) saliencyfeaturesofboundaryconnectivityandmanifoldranking SVO 0.6 CCAB on the LAB histogram channel, and (d) saliency maps produced RC n SF byourLSSVMapproach. ecisio0.4 LHGRSMKR Pr PCA MC DSR content-awareimageresizingtechniques(e.g.[2]). Instead, 0.2 RBD DRFI standard bicubic interpolation method may be enough for HDCT ECSSD LSSVM backgroundimagesshowninFig.4. χ2 LSSVM 0 0 0.2 0.4 0.6 0.8 1 For future work, we plan to investigate more advanced Recall global features, such as CNN features used in [46], to fur- Figure 5. Precision-Recall curves of different approaches on therincreasetheaccuracyofclassificationofsalientobject MSRA-BandECSSDbenchmarkdatasets.(Updated.) imagesandbackgroundimages. Since most existing approaches focus on unsupervised andsupervisedscenarios,wehopeourworktodrawatten- MATLAB code and 0.5h to train the rbf SVM using C++. tionofresearchersontheweaksupervisionandmakethem In testing, it takes around 3s to extract features. Our ap- realizethevalueofbackgroundimages.Wewillreleaseour proachtakes0.02sforjointinferenceoftheexistencelabel codeandbackgroundimagesforfurtherresearch. andsaliencymap.Incontrast,ittakes0.21sfortherbfSVM topredictthesalientobjectexistence(excludingfeatureex- References traction)andRBDtakes0.3stooutputasaliencymap. [1] R. Achanta, S. Hemami, F. Estrada, and S. Su¨sstrunk. 5.5.Limitations Frequency-tuned salient region detection. In CVPR, 2009. Sometimesourapproachmakesincorrectclassifications 2 between salient object images and background images. [2] S.AvidanandA.Shamir. Seamcarvingforcontent-aware See Fig. 6 for some failure cases. In the top row, the bird imageresizing. InACMTOG,volume26,page10,2007. 1, 8 ishidingintheleaves,wheretheclutteredbackgroundand [3] A.Borji. Whatisasalientobject? adatasetandabaseline complexstructureofthebirdmakethesalientobjectdetec- modelforsalientobjectdetection. InIEEETIP.2014. 1 tiondifficultevenforahumanbeingatafirstglance. Inthe [4] A.Borji, M.-M.Cheng, H.Jiang, andJ.Li. Salientobject bottomrow,texturesoftheimageproduceinferiorsaliency detection: Asurvey. arXivpreprintarXiv:1411.5878,2014. features,resultinginanincorrectclassification. 2 [5] A.BorjiandL.Itti.Exploitinglocalandglobalpatchrarities 6.DiscussionandConclusion forsaliencydetection. InCVPR,pages478–485,2012. 2,3 [6] A.Borji,D.N.Sihite,andL.Itti.Salientobjectdetection:A In this paper, we propose a weakly supervised learn- benchmark. InECCV,pages414–429,2012. 2,5 ing approach for salient object detection based on the la- [7] Y.BoykovandV.Kolmogorov.Anexperimentalcomparison tent structural SVM framework using background images. ofmin-cut/max-flowalgorithmsforenergyminimizationin Without any prior assumption of existence of salient ob- vision. IEEETPAMI.,26(9):1124–1137,2004. 1,5 jects,ourapproachiscapableofjointlydealingwithsalient [8] K.-Y.Chang, T.-L.Liu, H.-T.Chen, andS.-H.Lai. Fusing objectexistencepredictionanddetectiontasks. Experimen- genericobjectnessandvisualsaliencyforsalientobjectde- talresultsonbenchmarkdatasetsvalidatetheeffectiveness tection. InICCV,pages914–921,2011. 2,5 ofourapproach. [9] M.-M.Cheng,N.J.Mitra,X.Huang,P.H.S.Torr,andS.- Asapotentialapplication,ifwecouldrecognizeaback- M.Hu. Globalcontrastbasedsalientregiondetection. IEEE ground image, we no longer need to resort to complicated TPAMI,37(3):569–582,2015. 2,3,5 [10] M.-M.Cheng,J.Warrell,W.-Y.Lin,S.Zheng,V.Vineet,and [30] A.OlivaandA.Torralba. Modelingtheshapeofthescene: N.Crook. Efficientsalientregiondetectionwithsoftimage Aholisticrepresentationofthespatialenvelope.IJCV,2001. abstraction. InICCV,pages1529–1536,2013. 2,3 4 [11] M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and [31] M.PandeyandS.Lazebnik. Scenerecognitionandweakly A.Vedaldi. Describingtexturesinthewild. InCVPR,pages supervised object localization with deformable part-based 3606–3613,2014. 5 models. InICCV,pages1307–1314,2011. 2 [12] T. Deselaers, B. Alexe, and V. Ferrari. Weakly supervised [32] F. Perazzi, P. Krahenbuhl, Y. Pritch, and A. Hornung. localization and learning with generic knowledge. IJCV, Saliency filters: Contrast based filtering for salient region 100(3):275–293,2012. 2 detection. InCVPR,pages733–740,2012. 1,2,3,5,7 [13] T.M.T.DoandT.Artie`res.Regularizedbundlemethodsfor [33] A. Quattoni, S. B. Wang, L. Morency, M. Collins, and convexandnon-convexrisks. JMLR,13:3539–3583,2012. T.Darrell. Hiddenconditionalrandomfields. IEEETPAMI, 4,5 29(10):1848–1852,2007. 2 [14] R. Fergus, P. Perona, and A. Zisserman. Weakly super- [34] X. Shen and Y. Wu. A unified approach to salient object visedscale-invariantlearningofmodelsforvisualrecogni- detectionvialowrankmatrixrecovery. InCVPR,2012. 5 tion. IJCV,71(3):273–303,2007. 2 [35] P. Siva, C. Russell, and T. Xiang. In defence of negative [15] S.Goferman, L.Zelnik-Manor, andA.Tal. Context-aware miningforannotatingweaklylabelleddata. InECCV,pages saliencydetection. IEEETPAMI,34(10),2012. 2,5 594–608,2012. 2 [16] L.Itti,C.Koch,andE.Niebur. Amodelofsaliency-based [36] K.D.Tang,R.Sukthankar,J.Yagnik,andF.Li.Discrimina- visualattentionforrapidsceneanalysis.IEEETPAMI,1998. tivesegmentannotationinweaklylabeledvideo. InCVPR, 1 pages2483–2490,2013. 2 [17] Y. Jia and M. Han. Category-independent object-level [37] A.VedaldiandA.Zisserman. Efficientadditivekernelsvia saliencydetection. InICCV,2013. 2 explicitfeaturemaps.IEEETPAMI,34(3):480–492,2012.6 [38] A.Vezhnevets,V.Ferrari,andJ.M.Buhmann. Weaklysu- [18] B. Jiang, L. Zhang, H. Lu, C. Yang, and M.-H. Yang. Saliency detection via absorbing markov chain. In ICCV, pervised structured output learning for semantic segmenta- tion. InCVPR,pages845–852,2012. 2 2013. 2,5 [39] J.Wang,L.Quan,J.Sun,X.Tang,andH.-Y.Shum. Picture [19] H.Jiang,J.Wang,Z.Yuan,T.Liu,andN.Zheng.Automatic collage. InCVPR,volume1,pages347–354,2006. 1 salientobjectsegmentationbasedoncontextandshapeprior. InBMVC,2011. 2,5 [40] P. Wang, J. Wang, G. Zeng, J. Feng, H. Zha, and S. Li. Salientobjectdetectionforsearchedwebimagesviaglobal [20] H. Jiang, J. Wang, Z. Yuan, Y. Wu, N. Zheng, and S. Li. saliency. InCVPR,pages3194–3201,2012. 1,2,3,4,5,6 Salient object detection: A discriminative regional feature integrationapproach. InCVPR,pages2083–2090,2013. 1, [41] Y.WangandG.Mori. Hiddenpartmodelsforhumanaction 2,3,4,5,7 recognition:Probabilisticversusmaxmargin. IEEETPAMI, 33(7):1310–1323,2011. 2 [21] P.Jiang,H.Ling,J.Yu,andJ.Peng.Salientregiondetection [42] Y.Wei,F.Wen,W.Zhu,andJ.Sun.Geodesicsaliencyusing by ufo: Uniqueness, focusness and objectness. In ICCV, background priors. In ECCV, volume 7574, pages 29–42. 2013. 2 2012. 2 [22] P.Khuwuthyakorn,A.Robles-Kelly,andJ.Zhou. Objectof [43] J.Xiao, J.Hays, K.A.Ehinger, A.Oliva, andA.Torralba. interestdetectionbysaliencylearning. InECCV.2010. 2 SUNdatabase:Large-scalescenerecognitionfromabbeyto [23] J.Kim,D.Han,Y.-W.Tai,andJ.Kim. Salientregiondetec- zoo. InCVPR,pages3485–3492,2010. 5 tionviahigh-dimensionalcolortransform. InCVPR,2014. [44] Q.Yan,L.Xu,J.Shi,andJ.Jia.Hierarchicalsaliencydetec- 2,5,7 tion. InCVPR,pages1155–1162.CVPR,2013. 2,5 [24] X.Li,H.Lu,L.Zhang,X.Ruan,andM.-H.Yang. Saliency [45] C. Yang, L. Zhang, H. Lu, X. Ruan, and M.-H. Yang. detection via dense and sparse reconstruction. In ICCV, Saliency detection via graph-based manifold ranking. In 2013. 2,5,7 CVPR,2013. 2,3,5,7 [25] T.Liu,Z.Yuan,J.Sun,J.Wang,N.Zheng,X.Tang,andH.- [46] J.Zhang,S.Ma,M.Sameki,S.Sclaroff,M.Betke,Z.Lin, Y.Shum. Learningtodetectasalientobject. IEEETPAMI, X.Shen,B.Price,andR.Mch. Salientobjectsubitizing. In 33(2):353–367,2011. 1,2,3 CVPR,2015. 2,3,5,8 [26] S.Lu,V.Mahadevan,andN.Vasconcelos.Learningoptimal [47] J.ZhangandS.Sclaroff.Saliencydetection:Abooleanmap seedsfordiffusion-basedsalientobjectdetection. InCVPR, approach. InICCV,pages153–160,2013. 2 2014. 2,5 [48] W.Zhu,S.Liang,Y.Wei,andJ.Sun. Saliencyoptimization [27] L. Mai, Y. Niu, and F. Liu. Saliency aggregation: A data- fromrobustbackgrounddetection. InCVPR,2014. 1,2,4, drivenapproach. InCVPR,pages1131–1138,2013. 2 5,7 [28] R. Margolin, A. Tal, and L. Zelnik-Manor. What makes a patchdistinct? InCVPR,pages1139–1146,2013. 2,5 [29] P.MehraniandO.Veksler. Saliencysegmentationbasedon learning and graph cut refinement. In BMVC, pages 1–12, 2010. 2