Co-segmentation for Space-Time Co-located Collections HadarAverbuch-Elor1 JohannesKopf2 TamirHazan3 DanielCohen-Or1 1TelAvivUniversity 2Facebook 3Technion–IsraelInstituteOfTechnology Abstract 7 Wepresentaco-segmentationtechniqueforspace-timeco- 1 locatedimagecollections. Theseprevalentcollectionscap- 0 turevariousdynamicevents,usuallybymultiplephotogra- 2 phers,andmaycontainmultipleco-occurringobjectswhich n are not necessarily part of the intended foreground ob- a ject,resultinginambiguitiesfortraditionalco-segmentation J techniques. Thus, todisambiguatewhatthecommonfore- 1 Figure 1: The appearance of the Duchess of Cambridge ground object is, we introduce a weakly-supervised tech- 3 variesthroughouttheimagesthatcaptureherweddingcer- nique, where we assume only a small seed, given in the emony. Startingfromasingleimagetemplate(markedwith ] formofasinglesegmentedimage.Wetakeadistributedap- V a red border), our method progressively expands the fore- proach,wherelocalbeliefmodelsarepropagatedandrein- C groundbeliefmodelacrosstheentirecollection. forcedwithsimilarimages.Ourtechniqueprogressivelyex- . pandstheforegroundandbackgroundbeliefmodelsacross s c the entire collection. The technique exploits the power of [ lection of images is known and has been referred to as theentiresetofimagewithoutbuildingaglobalmodel,and co-segmentation [7, 21, 6]. A traditional co-segmentation 1 thussuccessfullyovercomeslargevariabilityinappearance problem assumes that objects which are both co-occurring v ofthecommonforegroundobject. Wedemonstratethatour 1 and salient necessarily belong to the foreground regions. method outperforms previous co-segmentation techniques 3 However,thespace-timeco-locationoftheimagesleadsto onchallengingspace-timeco-locatedcollections,including 9 a more challenging setting, where the premise of common 8 densebenchmarkdatasetswhichwereadaptedforournovel co-segmentationtechniquesisnolongervalid, asthefore- 0 problemsetting. ground object is not well-defined. Therefore, we ask the . 1 usertoprovideasegmentedtemplateimagetospecifywhat 0 1 Introduction theintendedforegroundobjectis. 7 1 Theforegroundobjectvariesconsiderablyinappearance v: Nowadays, Crowdcam photography is both abundant and acrosstheentirespace-timeco-locatedcollection.Thus,we i prevalent[2,1].Acrowdofpeoplecapturingvariousevents donotuseasingleglobalmodeltorepresentit,butinstead X form collections with great variety in content. However, takeadistributedlocalapproach. Wedecomposeeachim- r they normally share a common theme. We refer to a col- age into parts at multiple scales. Parts store local beliefs a lection of images that was captured about the same time about the foreground and background models. These be- andspaceas‘’Space-timeCo-located”images, andweas- liefs are iteratively propagated to similar parts within and sumethatsuchaco-locatedcollectioncontainsasignificant among images. In each iteration, one image is selected as subset of images that share a common foreground object, thecurrentseed. SeeFigure2whichillustratestheprogres- butotherobjectsmayalsoco-occurthroughoutthecollec- sionofbeliefsinthenetworkofimages. tion. SeeFigure1forsuchanexample,wheretheDuchess The propagation of beliefs from a given seed is formu- of Cambridge is photographed in her wedding, and some lated as a convex belief propagation (CBP) optimization. of the images contain, for instance, her husband, Duke of Foreground and background likelihood maps of neighbor- Cambridge. ingimagesarefirstinferredindependently(seeSection4.2). Foreground extraction is one of the most fundamen- Thesebeliefsarethenreinforcedacrossimagestoconsoli- tal problems in computer vision, receiving ongoing atten- datelocalmodelsandthusallowformorerefinedlikelihood tion for several decades now. Technically, the problem estimates (see Section 4.3). To allow for a joint image in- of cutting out the common foreground object from a col- ference,weextendtheCBPalgorithmtoincludequadratic 1 Iterationi Iterationi+1 Iterationi+2 Figure 2: Our technique iteratively propagates beliefs to images (framed in blue) which are adjacent to the current seed (framed in red). In each iteration, object likelihood maps are first inferred from the seed image to each one of its adjacent images(illustratedwithrededges)andthenthesemapsarepropagatedacrosssimilarimages(illustratedwithblueedges)to reinforcetheinference. terms. State-of-the-art co-segmentation methods are based on Weshowthatwhenstartingfromareliableseedmodel, recent advancements in feature matching and correspon- we can progressively expand the foreground belief model dence techniques [22, 21, 7]. Rubinstein et al. [21] pro- across the entire collection. This gradual progression suc- posed to combine saliency and dense correspondences to ceeds to co-segment the collection, outperforming state- co-segmentlargeandnoisyinternetcollections. Faktorand of-the-art co-segmentation techniques on rich benchmark Irani [7] also use dense correspondences, however, they datasets which were adapted for our problem setting. We computestatisticalsignificanceofthesharedregions,rather alsoprovideanextensiveevaluationonvariousspace-time thancomputingsaliencyseparatelyperimage. Thesetech- co-locatedcollectionswhichcontainrepeatedelementsthat niquesareunsupervised,andtheyassumethatrecurrentand do not necessarily belong to the semantic foreground re- uniquesegmentsnecessarilybelongtotheobjectofinterest. gion.Ouranalysisdemonstratestheadvantagesofourtech- However,inmanycollectionsthisisnotthecase,andsome niqueoverpreviousmethods,andinparticularillustratesits minimalsemanticsupervisionisthenrequired. Batraetal. robustnessagainstsignificantclutteredbackgrounds. [3],forexample,aimedattopicallyrelatedimagesandtheir Themaincontributionsofourworkare(i)theintroduc- supervisionwasgivenintheformofmultipleuserscribbles. tion of the novel co-segmentation problem for space-time In our work, we deal with images that belong to the same co-locatedimagecollections,(ii)adistributedapproachthat instance,andnottoageneralclass,whichexhibitgreatvari- can handle the great variability in appearance of the fore- abilityinappearance. Weusethesegmentationofasingle groundobject,and(iii)anextendedvariationalschemefor image in the collection to guide the process and target the propagatinginformationwithinandacrossimages. intendedobject. The work of Kim and Xing [12] is most closely related to ours. In their work they address the problem of multi- 2 Related work pleforegroundco-segmentation,whereK objectsofinter- estrepeatedlyoccuroveranentireimagecollection. They Segmenting and extracting the foreground object from an showpromisingresultswhenroughly20%percentoftheir imageisafundamentalandchallengingproblem,whichhas images are manually annotated. In our work, we target a receivedsignificantongoingattention. Extractingthefore- singleobjectofinterest,whilespace-timeco-locatedcollec- groundobjectrequiressomeguidanceorsupervisionsince tionsoftencontainseveralrepeatedelementsthatclutterand inmostcasesitisunclearwhatthesemanticintentis.When distract common means to distinct the foreground. Unlike several images that share a common foreground are given, theirglobaloptimizationframework, thatsolvesforallthe the problem is referred to as co-segmentation [20]. Many segmentationsatonce,ourtechniquegraduallyprogresses, solutionshavebeenproposedtotheco-segmentationprob- and each image in turn guides the segmentation of its ad- lem, which can be applied to image collections of varying jacent images. In this sense of progression, the method of sizesandcharacteristics[6,5,7,21,14]. Co-segmentation Kuettelet. al[15]issimilartoours. However,theirmethod techniques learn the appearance commonalities in the col- is strongly based on the semantic hierarchy of ImageNet, lection to infer the common foreground object or objects. whileweaimatsegmentinganunstructuredspace-timeco- To initialize the learning process, unsupervised techniques locatedimagecollection. are usually based on objectness [23] or visual saliency Thereareotherspace-timeco-locatedsettingswhereim- [6,22]cuestoestimatethetargetobject. ages share a common foreground. One setting is a video 2 In what follows, we first explicitly define the graph topology. We then describe how these beliefs are gradu- allyspreadacrosstheentireimagecollection,startingfrom theuser-segmentedtemplateimage. 3.1 Propagationgraphtopology Thebasisforthepropagationareimageparts. Toobtainthe parts, we use the hierarchical image segmentation method of Arbela´ez et al. [18]. We threshold the ultrametric con- Figure3:Theimage-levelgraphGi(ontop)definesatopol- tourmap, whichdefinesthehierarchyofimageregions, at ogyofimagesoverwhichlocalbeliefmodelsareiteratively a relatively fine level (λ = 0.15). See Figure 6 (on the i propagated. In each iteration, a seed image (marked with left) for an illustration of the parts obtained at a number anorangeborder)propagatestheF/Blikelihoodmapstoits of different levels. The level we use for the image parts adjacentimages(markedwithapurpleborder). Fromthese is illustrated in the left-most image. Although a fine level likelihood estimates, we extract the common foreground yieldsalargenumberandperhapslessmeaningfulparts,it object(inred)andchoosethenextseedimage. should be noted that a coarser level often merges between foregroundandbackgroundparts. We construct a parts-level graph Gp, where edges con- sequence [19, 8], where the coherence among the frames nect corresponding parts or spatially neighboring parts is very high. It is worth noting that even then, the extrac- withinanimage. Tocomputereliablecorrespondencesbe- tion of the foreground object is surprisingly difficult. Kim tween image parts, we use the non-rigid dense correspon- andXing[13]presentedanapproachtoco-segmentmulti- dencetechnique(NRDC)[10],whichoutputsaconfidence ple photo streams, which are captured by various users at measure(withvaluesbetween0and1)alongwitheachdis- varyingplacesandtimes. Similarlytoourwork, theyalso placement value. We consider corresponding pixels to be iteratively use a belief propagation model over the image thosewithaconfidencewhichexceedsacertainthreshold, graph. Another setting is multi-view object segmentation, whichwesetempiricallyto0.5. e.g., [9, 4], where the common object is captured from a Two images are connected by an edge in the associated calibratedsetofcameras. Thesetechniquescommonlyem- image-levelgraphGi ifthereexistsatleastoneedgeinGp ploy 3D reconstruction of the scene to co-segment the ob- thatconnectsthepairofimages. jectofinterest.Inoursetting,theimagesarerathersparsein scene-spaceandnotnecessarilycapturedallatonce,which makes any attempt to reconstruct the target object highly 3.2 Iterativelikelihoodpropagation improbable. We assign each part in Gp a foreground likelihood. Ini- tially all parts are equally likely to be foreground or back- 3 Co-segmentation using iterative ground(exceptthepartsintheuser-segmentedtemplateim- propagation age,whoseF-likelihoodiseitherexactly0or1). Thelikelihoodsareiterativelypropagatedthroughoutthe We describe the foreground and background models, de- graphs. In each iteration, a seed image is selected and its noted by F and B, using local beliefs that are propagated likelihoodsarepropagatedtotheadjacentneighborsinGi. withinandacrossimages. Todefineatopologyoverwhich In the first iteration, the seed image is always the user- we propagate beliefs, we construct a parts-level graph Gp, segmented template image. In subsequent iterations the where nodes are image parts from all images, and edges seedimagerandomlypickedfromtheneighborsofthecur- connectcorrespondingpartsindifferentimagesorspatially rent seed. Within an iteration, the seed image likelihoods neighboring parts within an image. Furthermore, we de- are considered fixed. Note that the template image likeli- fine an associated image-level graph Gi, where the nodes hoodsremainfixedthroughoutthewholealgorithm. correspond to the images, and two images are connected Thedetailsofthispropagationaredescribedinthenext byanedgeifthereexistsatleastoneedgeinGp thatcon- section. Thenewestimatesarefirstderivedseparately, ac- nectsthepairofimages. TheF/Blikelihoodsareiteratively cordingtoSection4.2,andarethenjointlyrefined,accord- propagated throughout the part-level graph Gp, while the ingtoSection4.3.Thesenewlikelihoodestimatesarecom- propagationflowisdeterminedaccordingtotheimage-level bined with previous estimates, where estimates are amor- graphGi. ThegraphtopologyisillustratedinFigure3. tized along their propagation, and get exponentially lower 3 weights over time, as we have more confidence in beliefs thatareclosertoourtemplateimage. After propagating the likelihoods, we update the foreground-background segmentation for the next seed us- ing a modified implementation of graph-cuts [15], where the unary terms are initialized according to the obtained likelihoods. Thealgorithmaboveisterminatedonceallimageshave been propagated to at least once. To avoid error accumu- lation, we execute the full pipeline multiple times (five in our implementation). The final results are obtained by av- Figure 4: Corresponding foreground parts of adjacent im- eraging all the likelihood estimates follows by a graph-cut agesinGiaccordingtop only(toprow)vsourenriched corr segmentation. compatibility measure (bottom row) that contains signifi- cantlymorecompatibleparts.Theforegroundsourceisdis- playedontheleft. 4 Likelihood inference propagation 4.2 Singletargetimageinference Our algorithm uses convex belief propagation and further In the following we present the basic component of our extends the variational approximate inference program to method, which infers an object likelihood map of a target include quadratic terms. Therefore, in Section 4.1, we imagefromanimageseed. WeconstructaMarkovrandom brieflyintroducenotationsusedinlatersections. InSection field(MRF)onthepartsofthetargetimageanduseacon- 4.2, we present an approach to infer an object likelihood vexbeliefpropagationtoinferthelikelihoodoftheseparts mapofasingletargetimagefromaseedimage. Finally,in tobelabeledasforeground. Section4.3,weintroduceatechniquetopropagatethelike- Each part can be labeled as either foreground or back- lihoodmapsacrosssimilarimagestoimprovetheaccuracy ground, i.e., y ∈ {−1,+1}. First, we describe the local andreliabilityoftheseinferredmaps. i potentialsofeachpartθ (y ),whichdescribethelikelihood i i of the part to belong to the foreground or the background. 4.1 Convexbeliefpropagation Then, we describe the pairwise potentials θ (y ,y ) , i,j i j which account for the spatial relations between adjacent Markov random fields (MRFs) consider joint distributions parts. We infer the foreground-background beliefs of the over discrete product spaces Y = Y1 × ··· × Yn. The parts in the target image bi(yi) by executing the standard joint probability is defined by combining potential func- convexbeliefpropagationalgorithm[24,11]. tions over subsets of variables. Throughout this work we consider two types of potential functions: single variable 4.2.1 Localpotentials functions, θ (y ), which correspond to the n vertices in a i i graph, i ∈ {1,...,n}, andfunctionsoverpairsofvariables Thelocalpotentialsθ (y )expresstheextentofagreement i i θ (y ,y )thatcorrespondtothegraphedges, (i,j) ∈ E. of a part with the foreground or background models. To i,j i j The joint distribution is then given by Gibbs probability definepartsintheseedimage,weusethetechniqueofAr- model: bela´ezet. al[18]atmultiplelevelstoobtainalargebagof candidatepartsofdifferentscales. Letibeapartinthetar- (cid:16)(cid:88) (cid:88) (cid:17) p(y ,...,y )∝exp θ (y )+ θ (y ,y ) . (1) get image, and s be a part in the source image seed. Then 1 n i i i,j i j i∈V i,j∈E foreachsourceparts,wecomputeitscompatibilitywitha targetparti,anddenoteitbyp (i,s). comp Many computer vision tasks require to infer various To construct the foreground/background likelihood of quantities from the Gibbs distribution, e.g., the marginal each part in the target image i, we consider the F/B parts (cid:80) probabilityp(yi)= y\yip(y1,...,yn). ofthesourceseed,andset Convexbeliefpropagation[24,11]isamessage-passing algorithm that computes the optimal beliefs b (y ) which θ (f)=maxp (i,s) and θ (b)=maxp (i,s), i i i comp i comp s∈F s∈B approximate the Gibbs marginal probabilities. Further- more, under certain conditions, these beliefs are precisely wheref andbarethetwolabelsthatcanbeassignedtoy . i the Gibbs marginal probabilities p(yi). For completeness, Wedefineourcompatibilitymeasureasfollows: in the Supplementary Material, we define these conditions andexplicitlydescribetheoptimizationprogram. p (i,s)=p (i,s)+δ·p (i,s), (2) comp corr sim 4 (a) (b) Figure 5: (a) Based on two reliable correspondences (in blue), the relative offset (in red) to a part (light blue) defines the regionwherethecorrespondentpartisexpected(markedwithalightbluecircle).(b)Themulti-scalepartsofthesource-seed imagearematchedtothepartsofthetargetimage(ontheright). Thepartsthatyieldmaximumcompatibilityarehighlighted inuniquecolors(correspondingtothehighlightedpartsofthetargetimageontheright). whereδ isabalancingcoefficientthatcontrolstheamount twocorrespondingpointsdefinearelativescalebetweenthe ofenrichmentoftheavailablesetofcorrespondences. The twoimages. Tocomputethecirclecenter,wecomputethe term p (i,s) measures the fraction of pixels that are relative offset from the closest corresponding point (using corr matched between parts i and s. This is measured accord- the relative scale). The radius is determined according to ingto thedistancetothenearestcorrespondingpointinthetarget. p (i,s)=N(i,s)|s|−1, (3) See Figure 5(a) for an illustration of the estimated corre- corr spondingregioni(s). where N(i,s) is the number of corresponding pixels, and Putting it all together, p (i,s) is measured according sim |s|isthenumberofpixelsinparts. Asmentionedbefore, to theWmeatcidheinngtifiisedbatsheadtohnigNhRlyDCco.mpatible parts are rather psim(i,s)=(cid:40)(cid:80)1u6=31(cid:112)hist(0i)u·hist(s)u i∈top-ko(tsh)e,rwi∩isei.(s)(cid:54)=∅ sparse, and thereby p (i,s) is almost always zero in corr manysource-targetpairs.Nonetheless,wecanexploitthese In our experiments, we set δ = 0.1 for all the fore- sparse correspondences to discover new compatible parts ground regions. See Figure 5(b) for an illustration of the withthetermp (i,s). SeeFigure4foranillustrationof multi-scale source parts that obtained maximum compati- sim thecompatibletargetpartsintheforegroundregionswith- bilitywithpartsinthetargetimage. out (top row) and with (bottom row) our enrichment term psim(i,s). Inpractice,sincethebackgrounddoesnotnec- 4.2.2 Pairwisepotentials essarilyappearinbothsourceandtargetimage,δ >0only forregionss∈F. The pairwise potential function θi,j(yi,yj) induces spatial consistencyfromthepartgenerationprocesswithintheim- Intheseforegroundregions, thetermp (i,s)aimsat sim age. As previously mentioned, we obtain parts at multiple revealingasimilaritybetweenpartswhoseappearanceand scalesbythresholdingatvaryinglevelsλ intheultrametric spatialdisplacementhighlyagree.Similarityinappearance, i contourmap(seeFigure6foranillustration). inourcontext,ismeasuredaccordingtotheBhattacharyya Tomeasurespatialconsistenciesbetweenadjacentparts coefficient of the RGB histograms, following the method in the target, we can compute how quickly these two parts of [16]. In order for parts i and s to highly agree in ap- merge into one by examining the level λ where the pearance, we further require that i ∈ top-k(s), where the merge twopartsbecomeone. Hence,wedefineapairwiserelation number of nearest neighbors is set to three. To recognize betweenadjacentpartsineachtargetimageaccordingto: parts whose spatial displacement agree, we utilize the set of corresponding pixels in the foreground regions. We ap- θintra(y ,y )=exp(−τ(λ −λ ))·y y (4) proximatethepixelvaluesofthepartcorrespondingtosac- i,j i j merge min i j cordingtotheknowncorrespondences. Formally, foreach where the parameter τ = 4 was set empirically, and the s ∈ F, let i(s) be the estimated corresponding region in finestlevelweexaminetomeasurethespatialconsistencies thetarget. Thus,forasimilaritybetweenpartsiands,we is λ = 0.2 (a merge there would induce the strongest min requirethati∩i(s)(cid:54)=∅. proximitybetweentheparts). Seetheheat-mapsinFigure To simplify computations, we assume i(s) to be a cir- 6 for an illustration of these pairwise potentials on a few clewithinthetargetimage,whichwecomputeaccordingto randomly-chosentargetparts. SeeFigure6foranillustra- theclosestandfarthestforegroundcorrespondences. These tionofθucm(y ,y ). i,j i j 5 Targetimage Targetpartsindifferentscales Visualizingthespatialconsistencies Figure 6: Given the image on the left, parts are obtained using hierarchical segmentation. For each target image in the collection, we obtain a large number of parts, as in the left-most image. On the right are visualizations of various parts (coloredinblack)andthepotentialswhichareinducedtotheirneighboringparts.Theneighboringpartsarecoloredaccording to their proximities which are expressed in Equation 4 (warm colors correspond to strong proximities while cool colors correspondtoweakerproximities). 4.3 Jointmulti-targetinference BRIDE SINGER TODDLER P J P J P J [21] 47.3 16.6 28.9 16.6 49.6 25.7 InSection4.2,wepresentedourapproachtoinferanobject [7] 71.1 42.9 68.4 30.4 81.8 44.1 [12] 63.9 27.4 88.8 63.4 66.7 38.9 likelihood map from a seed image. In our setting, simi- Ours 88.9 76.3 94.2 83.0 94.5 74.2 lar regions may co-occur across multiple images. There- Table 1: Comparison against co-segmentation techniques fore, to improve the accuracy and reliability of the likeli- onanannotatedsubsetofourspace-timeco-locatedcollec- hood maps obtained by a single inference step, we propa- tions. gate the inferred maps onto adjacent images in the image graph Gi. The output beliefs of each inferred target im- 5 Results and evaluation age is sent to its neighbors as a heat-map (i.e., per part foreground-background probability). Thus, our likelihood Weempiricallytestedouralgorithmonvariousdatasetsand mapsarecomplementedwithjointinferenceacrossimages. compared our segmentation results to two state-of-the-art To differentiate the different types of edges on Gp, we unsupervisedco-segmentationtechniques[7,21]andtoan- denotetheedgesthatconnectpartsacrossimagesbyEb. A othersemi-supervisedco-segmentationtechnique[12]. The joint inference is encouraged by a pairwise potential func- merit of comparing our weakly-supervised technique with tion between matchedparts in Eb. Sincethe labels satisfy unsupervisedonesistwofold;first,itservesasaqualitative y ,y ∈{−1,+1},thiscanbedonewiththepotentialfunc- calibrationoftheresultsonthenewco-locatedsetting,and i j tion second,itclearlydemonstratesthenecessityofsomemini- nalsupervisiontodefinethesemanticforegroundregion. (cid:16) (cid:17) θb (y ,y ) = p (i,j)+p (j,i) y y (5) Forallthreemethodsweusedtheoriginalimplementa- i,j i j corr corr i j tionsbytheauthorswhicharepubliclyavailableonline. It shouldbenotedthatalthough[12]discussanunsupervised Simply stated, the local potentials propagate the output approachaswell,theyonlyprovideimplementationsforthe beliefs of one target as input potentials of its neighboring semi-supervised approach. To be compatible to our input, images.Ourintuitionisthattheoutputbeliefsb (y ),which i i weprovidetheirmethodwithoneinputtemplatemask. We isconcludedbyrunningaconvexbeliefpropagationwithin measure the performance on different datasets, including itsimage,servesasasourceseedsignaltotheneighboring benchmark datasets that were adapted for our novel prob- image. This introduces novel non-linearities to the propa- lemsetting. Thefullimplementationofourmethod,along gation algorithm. To control these non-linearities, we ex- withthedatasetsthatwereusedintheexperiments,isavail- tend the variational approximate inference program to in- able at our project website at: https://cs.tau.ac. cludequadraticterms. il/˜averbuch1/coseg/. We also determine the conditions for which these quadratic terms still define a concave inference program Space-time images We evaluated our technique on var- andprovethatrepeatedconvexbeliefpropagationiterations ious challenging space-time co-located image collections acrossimagesachieveitsglobaloptimum. Formoredetails depicting various dynamic events. Some of them (BRIDE, onourextendedvariationalapproximateinferenceprogram, SINGER, and BROADWAY) were downloaded from the in- togetherwithanempiricalevaluationofourinferencetech- ternet, while others (TODDLER, BABY, SINGER WITH nique, please refer to the Supplementary Material, which GUITARIST, and PERU) were casually captured by multi- canbefoundontheprojectwebsite. plephotographers.Theseimagescontainrepeatedelements 6 Figure 7: Comparison to [12] on their dataset. The top row illustrates the ground-truth labellings of multiple foreground objects: applebasket(inpink), pumpkin(inorange), baby(inyellow)andtwogirls(ingreenandblue). Thebottomrows illustrate their results (middle row) and ours (bottom row). On average, our method yields higher P scores (97.32% (cid:29) 93.11%)andcomparableJ scores(49.02%∼49.61%). that do not necessarily belong to the semantic foreground ple foreground objects in different colors. We execute our region,andtheappearanceoftheforegroundvariesgreatly methodmultipletimeswithdifferentseedstomeettheirin- throughout the collections. We provide thumbnails for the put. As we can see here and in general, our method has full seven sets, together with results and comparisons, on less false-positives and is more resistant to cluttered back- ourprojectwebsite. Pleaserefertotheseresultsforassess- grounds. If we are able to spread our beliefs towards the ingthehighqualityofourresults. targetimage,thenwesucceedincapturingtheobjectrather For a quantitative analysis, we manually annotated the well. Quantitatively, ourtechniquecutstheprecisionerror foreground regions of three of our collections (BRIDE, bymorethanhalf(from6.89%downto2.68%). However, TODDLER, and SINGER), and report the precision P (per- ifthereisnotenoughconfidencethatreachesthetargetim- centage of correctly labeled pixels) and Jaccard similar- age,thentheobjectremainsundetected,ascanbeobserved ity J (intersection over union of result and ground truth intheuncoloredbasketofapplesintherightmostimage. segmentations) as in previous works (see Table 1(b)). It should be noted that, to strengthen the evaluation, we per- Sampled video collections The DAVIS dataset [17] is a form three independent runs for the semi-supervised tech- recentbenchmarkforvideosegmentationtechniques, con- niques,startingfromdifferentrandomseeds,andreportthe taining50sequencesthatexhibitvariouschallengesinclud- average scores. Figure 9 showsa sample ofresults, where ingocclusionsandappearancechanges. Thedatasetcomes the left-most image is provided as template for the semi- with per-frame, per-pixel ground-truth annotations. We supervised techniques. As can be observed from our re- sparsifiedthesesequences(takingevery10thframe)tocon- sults, the unsupervised co-segmentation techniques fail al- structalargenumberofdatasetsthataresomewhatrelated most completely on our co-located collections. Regarding toourproblemsetting. Table2showstheintersection-over- the semi-supervised technique, as Figure 9 demonstrates, union(IoU)scoresonarepresentativesubsetandtheaver- when both the foreground and background regions highly age over all 50 collections. Similar to the input provided resemble those of their counterparts in the given template, to video segmentation techniques in the mask propagation then the results of [12] are somewhat comparable to ours. task, we also provide the semi-supervised techniques with As soon as the backgrounds differ or there are additional amanualsegmentationofthefirstframe. However,onour modelsthatwerenotinthetemplate,theirmethodincludes sparsifiedcollections,subsequentframesarequitedifferent, many outliers, as can be seen in Figure 9. Unlike their asillustratedinFigure8. method, we avoid defining strict global models that hold Our extensive evaluation on the adapted DAVIS bench- for all the images in the collection, and thus allow flexi- mark clearly illustrates, first of all, the difficulty of the bilitythatisrequiredtodealwiththevariabilityacrossthe problem setting, as the image structure is not temporally- collection. coherent, and unlike dense video techniques, we cannot Multipleforegroundobjects Wealsocomparedourper- benefit from any temporal priors. Furthermore, it demon- formance to [12] using their data. We use their main ex- strates the robustness of our technique, as it achieves the ample,whichalsocorrespondstoourproblemsetting. The highestscoresonmostofthedatasets,aswellasthehigh- resultsaredisplayedinFigure7wherewemarkthemulti- estaveragescoreonall50collections. 7 [21] [7] [12] Ours bear 0.19 0.05 0.73 0.92 blackswan 0.30 0.07 0.74 0.64 bmx-trees 0.04 0.06 0.08 0.27 bmx-bumps 0.03 0.17 0.24 0.28 breakdance-flare 0.10 0.08 0.25 0.08 breakdance 0.09 0.09 0.50 0.29 bus 0.50 0.84 0.73 0.79 dance-twirl 0.13 0.04 0.15 0.40 libby 0.38 0.14 0.29 0.41 dog 0.36 0.61 0.71 0.57 drift-chicane 0.02 0.00 0.02 0.00 drift-straight 0.13 0.31 0.11 0.26 mallard-water 0.06 0.37 0.46 0.69 mallard-fly 0.01 0.05 0.47 0.13 elephant 0.13 0.00 0.28 0.45 flamingo 0.18 0.23 0.43 0.64 Figure8:Qualitativeresultsofourtechniqueonasequenceofsparseframessampledfrom goat 0.07 0.04 0.42 0.64 hike 0.17 0.00 0.36 0.89 theDavisdataset[17],wherethefirstframeisprovidedastemplate. paragliding 0.30 0.13 0.80 0.82 soccerball 0.02 0.00 0.37 0.67 Table 2: IoU scores on the sparsified DAVIS benchmarks. Following previous work, the surf 0.12 0.94 0.63 0.96 Average 0.16 0.22 0.36 0.53 scores are provided on a representative subset and the average computed over all 50 se- quences. 6 Conclusions and future work thecollectioncanbecomedensewithstrongerlocalconnec- tions. For such massive collections, the foreground object does not have to be only a single object. We can propa- In this work, we have presented a co-segmentation gate multi-target beliefs over the image network, like we method that takes a distributed approach. Common co- demonstratedinourcomparisontoKimandXing[12]. Fi- segmentation methods gather information from all the im- nally, the distributed nature of our method, leads itself to ageinthecollection,analyzeitglobally,buildingacommon parallelcomputation,whichcanbeeffectiveforlargescale model,andtheninferthecommonforegroundobjectsinall, collections. orpartof,theimages. Here,thereisnoglobalmodel. The beliefs are propagated across the collection without form- ing a global model of the foreground object. Each image References independently, collects the beliefs from its neighbors, and consequentiallyinfersitsownmodelfortheforegroundob- [1] A. Arpa, L. Ballan, R. Sukthankar, G. Taubin, M. Polle- ject. Althoughourmethodisdistributed,currentlythereis feys,andR.Raskar.Crowdcam:Instantaneousnavigationof a seed model, which clearly does not concur to the claim crowdimagesusingangledgraph. In3DVision-3DV2013, ofhavingadistributedmethod. However,somesupervision 2013 International Conference on, pages 422–429. IEEE, isnecessarilyrequiredtodefinethesemantictargetmodel. 2013. 1 Currently, it is provided as a single segmented image, but [2] T. Basha, Y. Moses, and S. Avidan. Photo sequencing. theseedmodelcanpossiblybeprovidedinotherforms. InComputerVision–ECCV2012,pages654–667.Springer, Wehaveshownthatourapproachoutperformsstate-of- 2012. 1 the-art co-segmentation methods. However, as our results [3] D.Batra,A.Kowdle,D.Parikh,J.Luo,andT.Chen.icoseg: demonstrate,therearelimitationsastheobjectcut-outsare Interactive co-segmentation with intelligent scribble guid- imperfect. Theentireobjectisnotalwaysinferredandalso ance. InComputerVisionandPatternRecognition(CVPR), portions of the background may contaminate the extracted 2010IEEEConferenceon,pages3169–3176.IEEE,2010.2 object. Toalleviatetheselimitations,therearetwopossible [4] N.D.Campbell,G.Vogiatzis,C.Herna´ndez,andR.Cipolla. avenues for future research: (i) one in high level, to bet- Automatic 3d object segmentation in multiple views us- ter learn the semantics of the object, perhaps using data- ing volumetric graph-cuts. Image and Vision Computing, 28(1):14–25,2010. 3 drivenapproaches,e.g.,convolutionalnetworks,and(ii)in low level, seeking for better alternatives to graph-cuts and [5] K.-Y.Chang,T.-L.Liu,andS.-H.Lai. Fromco-saliencyto co-segmentation:Anefficientandfullyunsupervisedenergy itsinherentlimitations. minimizationmodel.InComputerVisionandPatternRecog- Inthefuture, wehopetoexploreourapproachonmas- nition(CVPR),2011IEEEConferenceon,pages2129–2136. sive collections, which may include thousands of pho- IEEE,2011. 2 tographs capturing interesting dynamic events. For exam- [6] M.-M.Cheng,N.J.Mitra,X.Huang,P.H.Torr,andS.-M. ple, acollectionofimagesofaparade, wherea3Drecon- Hu. Global contrast based salient region detection. IEEE structionisnotapplicable. Thelargernumberofimagesis TransactionsonPatternAnalysisandMachineIntelligence, notjustaquantitativedifference,butqualitativeaswell,as 37(3):569–582,2015. 1,2 8 1] 2 [ 2 2 2 2 7] [ 2 2 2 2 2] 1 [ 2 2 2 2 s ur O 2 2 2 2 1] 2 [ 2 2 2 2 7] [ 2 2 2 2 2] 1 [ 2 2 2 2 s ur O 2 2 2Figure9: Qualitativecomparisontostate-of-the-artco-segmentationtechniquesonourspace-timecollections. Forbothour 2 technique and [12], the left-most image is provided as template. Please refer to our project website for an extensive and interactivecomparison. [7] A.FaktorandM.Irani.Co-segmentationbycomposition.In AsianConferenceofComputerVision.Citeseer,2004. 3 ProceedingsoftheIEEEInternationalConferenceonCom- [10] Y.HaCohen,E.Shechtman,D.B.Goldman,andD.Lischin- puterVision,pages1297–1304,2013. 1,2,6,8,9 ski. Non-rigid dense correspondence with applications for [8] Q.Fan,F.Zhong,D.Lischinski,D.Cohen-Or,andB.Chen. imageenhancement. ACMtransactionsongraphics(TOG), Jumpcut: non-successive mask transfer and interpolation 30(4):70,2011. 3 for video cutout. ACM Transactions on Graphics (TOG), [11] T.Heskes. Convexityargumentsforefficientminimization 34(6):195,2015. 3 oftheBetheandKikuchifreeenergies. JournalofArtificial [9] Z. Gang and Q. Long. Silhouette extraction from multiple IntelligenceResearch,26(1):153–190,2006. 4 images of an unknown background. In Proceedings of the [12] G.KimandE.P.Xing.Onmultipleforegroundcosegmenta- 9 tion. InComputerVisionandPatternRecognition(CVPR), 2012IEEEConferenceon,pages837–844.IEEE,2012. 2, 6,7,8,9 [13] G.KimandE.P.Xing.Jointlyaligningandsegmentingmul- tiplewebphotostreamsfortheinferenceofcollectivephoto storylines. InProceedingsoftheIEEEConferenceonCom- puterVisionandPatternRecognition,pages620–627,2013. 3 [14] G.Kim,E.P.Xing,L.Fei-Fei,andT.Kanade. Distributed cosegmentationviasubmodularoptimizationonanisotropic diffusion. InComputerVision(ICCV),2011IEEEInterna- tionalConferenceon,pages169–176.IEEE,2011. 2 [15] D. Kuettel, M. Guillaumin, and V. Ferrari. Segmentation propagationinimagenet. InComputerVision–ECCV2012, pages459–473.Springer,2012. 2,4 [16] J.Ning,L.Zhang,D.Zhang,andC.Wu. Interactiveimage segmentation by maximal similarity based region merging. PatternRecognition,43(2):445–456,2010. 5 [17] F. Perazzi, J. Pont-Tuset, B. McWilliams, L. V. Gool, M. Gross, and A. Sorkine-Hornung. A benchmark dataset andevaluationmethodologyforvideoobjectsegmentation. InComputerVisionandPatternRecognition,2016. 7,8 [18] J. Pont-Tuset, P. Arbelaez, J. T. Barron, F. Marques, and J.Malik. Multiscalecombinatorialgroupingforimageseg- mentationandobjectproposalgeneration.IEEEtransactions on pattern analysis and machine intelligence, 39(1):128– 140,2017. 3,4 [19] S. A. Ramakanth and R. V. Babu. Seamseg: Video ob- ject segmentation using patch seams. In Computer Vision andPatternRecognition(CVPR),2014IEEEConferenceon, pages376–383.IEEE,2014. 3 [20] C. Rother, T. Minka, A. Blake, and V. Kolmogorov. Cosegmentation of image pairs by histogram matching- incorporatingaglobalconstraintintomrfs. InComputerVi- sionandPatternRecognition,2006IEEEComputerSociety Conferenceon,volume1,pages993–1000.IEEE,2006. 2 [21] M.Rubinstein,A.Joulin,J.Kopf,andC.Liu. Unsupervised jointobjectdiscoveryandsegmentationininternetimages. InComputerVisionandPatternRecognition(CVPR),2013 IEEEConferenceon,pages1939–1946.IEEE,2013.1,2,6, 8,9 [22] J. C. Rubio, J. Serrat, A. Lo´pez, and N. Paragios. Unsu- pervisedco-segmentationthroughregionmatching.InCom- puter Vision and Pattern Recognition (CVPR), 2012 IEEE Conferenceon,pages749–756.IEEE,2012. 2 [23] S.Vicente,C.Rother,andV.Kolmogorov.Objectcosegmen- tation.InComputerVisionandPatternRecognition(CVPR), 2011IEEEConferenceon,pages2217–2224.IEEE,2011.2 [24] M.J.Wainwright,T.S.Jaakkola,andA.S.Willsky. Anew classofupperboundsonthelogpartitionfunction. Trans. onInformationTheory,51(7):2313–2335,2005. 4 10