Parametric Image Segmentation of Humans with Structural Shape Priors Alin-IonutPopa2,CristianSminchisescu1,2 1DepartmentofMathematics,FacultyofEngineering,LundUniversity 2InstituteofMathematicsofTheRomanianAcademy [email protected], [email protected] 5 1 0 2 Abstract mentation in RGB images remains extremely challenging, n because people are observed from a variety of viewpoints, a The figure-ground segmentation of humans in images have complex articulated skeletal structure, varying body J captured in natural environments is an outstanding open proportions and clothing, and are often partially occluded 7 problemduetothepresenceofcomplexbackgrounds,artic- byotherpeopleorobjectsinthescene. Thecomplexityof 2 ulation, varyingbodyproportions, partialviewsandview- thebackgroundfurthercomplicatesmatters,particularlyas ] point changes. In this work we propose class-specific seg- any limb decomposition of the human body leads to parts V mentation models that leverage parametric max-flow im- that are relatively regular but not sufficiently distinctive C age segmentation and a large dataset of human shapes. evenwhenspatialconnectivityconstraintsareenforced[47]. . Ourcontributionsareasfollows: (1)formulationofasub- Set aside appearance inhomogeneity and color variability s c modular energy model that combines class-specific struc- duetoclothing,whichcanoverlapthebackgrounddistribu- [ tural constraints and data-driven shape priors, within a tionsignificantly,itiswellknownthatmanyofthegeneric, 1 parametric max-flow optimization methodology that sys- parallel line (ribbon) detectors designed to detect human v tematically computes all breakpoints of the model in poly- limbs, fire at high false positive rates in the background. 2 nomial time; (2) design of a data-driven class-specific fu- This has motivated work towards detecting more distinc- 2 sionmethodology,basedonmatchingagainstalargetrain- tivepartconfigurations, withoutrestrictiveassumptionson 7 ing set of exemplar human shapes (100,000 in our ex- part visibility (e.g. full or upper view of the person), for 6 0 periments), that allows the shape prior to be constructed which poselets[7] have been a successful example. How- . on-the-fly, for arbitrary viewpoints and partial views. (3) ever, besides relatively high false positive rates typical in 1 0 demonstration of state of the art results, in two challeng- detection,thetransitionfromaboundingboxoftheperson 5 ingdatasets,H3DandMPII(wherefigure-groundsegmen- toafullsegmentationofthehumanbodyisnotstraightfor- 1 tation annotations have been added by us), where we sub- ward. The challenge is to balance, on one hand, sufficient v: stantially improve on the first ranked hypothesis estimates flexibilitytowardsrepresentingvariabilityduetoviewpoint, i ofmid-levelsegmentationmethods,by20%,withhypothe- partialviewsandarticulation,and,ontheotherhand,suffi- X sis set sizes that are up to one order of magnitude smaller. cient constraints in order to obtain segmentations that cor- r respondtomeaningfulhumanshapes,allrelyingonregion a or structural human body part detectors that may only be partialornotalwaysspatiallyaccurate. 1.Introduction In this work we attempt to connect two relevant, re- cent lines of work, for the segmentation of people in real Detecting and segmenting people in real-world envi- images. We rely on bottom-up figure-ground generation ronments are central problems with applications in index- methodsandregion-levelpersonclassifiersinordertoiden- ing,surveillance,3Dreconstructionandactionrecognition. tify promising hypothesis for further processing. In a Priorworkin3Dhumanposereconstructionfrommonoc- second pass we set up informed constraints towards (hu- ularimages[43,22,21], aswellasmorerecent, successful man)class-specificfigure-groundsegmentationbyleverag- RGB-D sensing systems based on Kinect[41] have shown ingskeletalinformationanddata-drivenshapepriorscom- thatthe availabilityof afigure-groundsegmentation opens putedontheflybymatchingregioncandidatesagainstex- pathstowardsrobustandscalablesystemsforhumansens- emplarsofalarge,recentlyintroducedhumanmotioncap- ing. Despite substantial progress, the figure-ground seg- turedatasetcontaining3Dand2Dsemanticskeletoninfor- 1 mationofpeople, aswellimagesandfigure-groundmasks humanshapes,andsearchthestatespaceusingadifferent, from background subtraction (Human3.6M[23]). By ex- parametricmultiplehypothesisscheme.Ourpriorconstruc- ploitinggloballyoptimalparametricmax-flowenergymin- tionuses,amongotherelements,aProcrustesalignmentnot imizationsolvers, thistimebasedonaclassdependent(as unlike[20]butdifferently: (1)weuseitforshapepriorcon- opposedtogenericandregular)foregroundseedingprocess struction (input dependent, on the fly) within energy opti- [18,25,13],weshowthatwecanconsiderablyimprovethe mizer as opposed to object detection (classification, con- state of the art. To our knowledge this is one of the first struction per class) in [20], (2) we only use instances that formulations for class-specific segmentation that can han- alignwellwithqueryreflectingaccurateshapemodeling,as dlemultipleviewpointsandanypartialviewoftheperson, opposedtofusingtop-kinstancestocaptureclassvariabil- in principle. It is also one of the first to leverage a large ityin[20].Analternative,interestingformulationforobject datasetofhumanshapes, togetherwithsemanticstructural segmentation with shape priors is branch-and-mincut[31], information, which until recently, have not been available. who propose a branch and bound procedure in the com- Weshowthatsuchconstraintsarecriticalforaccuracy,ro- poundspaceofbinarysegmentationsandhierarchicallyor- bustness,andcomputationalefficiency. ganized shapes. However, the bounding process used for efficientsearchinshapespacewouldrelyonknowledgeof 1.1.RelatedWork thetypeofshapesexpectedandtheirfullvisibility. Wefo- Theliteratureonsegmentationishuge, evenwhencon- cusonadifferentoptimizationandmodelingapproachthat sidered only under sub-categories like top-down (class- canhandlearbitraryocclusionpatternsofshape. Ourprior specific)andbottom-upsegmentation. Humansareofcon- constraint for optimization is generated on the fly by fus- siderableinteresttobedevotedspecialmethodology,ifthat ingthevisibleexemplarcomponents,followingastructural provestobeeffective[28,45,19,46,42,16,3,7,48,47,49]. alignmentscheme. One approach is to consider shape as category-specific Recently there has been a resurrection of bottom-up property and integrate it within models that are driven by segmentation methods based on multiple proposal gener- bottom-upprocessing[9,6,1,27,10,29,33,7,30,38,17]. ation, with surprisingly good results considering the low- Pishchulin et al. [38] develop pictorial structure formula- level processing involved. Some of these methods gener- tions constrained by poselets, focusing on improving the ate segment hypotheses either by combining the superpix- responsequalityofanarticulatedpart-basedhumanmodel. els of a hierarchical clustering method[4, 37, 44, 11], by The use of priors based on exemplars has also been ex- varying the segmentation parameters[15] or by searching plored, in a data-driven process. Both [40, 39] focus on an energy model, parametrically, using graph cuts[13, 15, amatchingprocessinordertoidentifyexemplarsthatcor- 24,34,36,14]. Mostofthelattertechniquesusemid-level respond to similar scene or object layouts, then used in a shapepriorsforselection,eitherfollowinghypothesisgen- graphcutprocessthatenforcesspatialsmoothnessandpro- eration [13, 15, 34] or during the process. Some methods vides a global solution. Our approach is related to such provide a ranking, diversification and compression of hy- methods,butweuseanoveldata-drivenpriorconstruction, potheses, usinge.g. MaximalMarginalRelevance(MMR) enforcestructuralconstraintsadaptedtohumans,andsearch diversification[13,15],whereasothersreportanunordered the state space exhaustively by means of parametric max- set[24, 34]. Hypothesis pool sizes in the order of 1,000- flow. In contrast to priors used in [40, 39], which require 10,000 range in the expansionary phase, and compressed a more repeatable scene layout, we focus on a prior gen- models of 100-1,000 hypotheses following the application eration process that can handle a diverse set of viewpoints oftrainedrankers(operatingonmid-levelfeaturesextracted andarbitrarypartialviews,notknowna-priori,anddifferent from segments) with diversification, are typical, with vari- acrossthedetectedinstances. ance due to image complexity and edge structure. While Methodslike[26]resembleoursintheirrelianceonade- priorworkhasshownthatsuchhypothesespoolscancon- tectionstageandtheprincipleofmatchingthatwindowrep- tainremarkablygoodqualitysegments(60−80%intersec- resentation against a training set where figure-ground seg- tionoverunion,IoU,scoresarenotuncommon)thisleaves mentations are available, then optimizing an energy func- sufficient space for improvement particularly since sooner tionbasedongraph-cuts. Ourwindowrepresentationcon- orlater,oneisinevitablyfacingtheburdenofdecisionmak- tains additional detail and this makes it possible to match ing:selectingonehypothesistoreport.Itisthennotuncom- exemplars based on the semantic content identified. Our monforperformancetosharplydropto40%.Thisindicates matchingandshapepriorconstructionareoptimizedforhu- bletootherobjectsthanpeople.Herewefocusonpeoplebecauseonlyfor mans, in contrast to the generic ones used in [26] (which them,fornow,largetrainingsetsofsegmentedshapeswithstructuralanno- canhoweversegmentanyobject,notjustpeople,asourfo- tationsareavailable,throughHuman3.6M[23].Howeveraslargedatasets cushere1). Weuselargepriorsetofstructurallyannotated forotherobjectcategoriesemerge,weexpectourmethodologytogener- alizewell. Inthisrespect, ourresultsonachallengingvisualcategory, 1Noticehoweverthatthemethodologyweproposewouldbeapplica- humans,areindicativeoftheperformanceboundsonecanexpect. that constraints and prior selection methods towards more j indexes representative pixels in the seed region, selected compact,betterqualityhypothesissetsarenecessary. Such ascentersresultingfromak-meansalgorithm(k issetto5 issuesareconfrontedinthecurrentwork. in all of our experiments). The background probability is definedsimilarly. 2.Methodology The pairwise term V penalizes the assignment of dif- uv ferentlabelstosimilarneighboringpixels: WewillconsideranimageasI :V →R3,whereV rep- resentsthesetofnodes,eachassociatedwithapixelinthe (cid:26) 0 ifx =x image,andtherangeistheassociatedintensity(RGB)vec- Vuv(xu,xv)= g(u,v) ifxu (cid:54)=xv (3) u v tor. TheimageismodeledasagraphG = (V,E). Wepar- titionthesetofnodesinV intotwodisjointsetsofVf and withsimilaritybetweenadjacentpixelsgivenbyg(u,v) = Vb whichrepresenttheassignmentsofpixelstoforeground exp(cid:104)−max(Gb(u),Gb(v))(cid:105). Gb returns the output of the and background, respectively. E is the subset of edges of σ2 multi-cue contour detector [35, 32] at a pixel. The bound- the graph G which reflects the connections between adja- ary sharpness parameter σ controls the smoothness of the cent pixels. The formulation we propose will rely on ob- pairwiseterm. ject(orforeground)structuralskeletonconstraintsobtained The energy function defined by (1) is submodular and frompersondetectionand2Dlocalization(inparticularthe can be optimized using parametric max-flow, in order to identificationofkeypointsassociatedwiththejointsofthe obtainallbreakpointsofE (X)asafunctionof(λ,X)in human body, and the resulting set of nodes corresponding λ polynomialtime. to the human skeleton, obtained by connecting keypoints, T ⊆V),aswellasadata-driven,humanshapefusionprior S :V →[0,1],constructedad-hocbyfusingsimilarconfig- urations with the one detected, based on a large dataset of human shapes with associated 2D skeleton semantics (see our§2.1fordetails). Theenergyfunctiondefinedoverthe graphG,X =∪{x }is: u (cid:88) (cid:88) E (X)= U (x )+ V (x ,x ) (1) λ λ u uv u v u∈V (u,v)∈E where Figure2.Processingstepsofoursegmentationmethodsbasedon U (x ) = D (x )+S(x ) Constrained Parametric Problem Dependent Cuts (CPDC) with λ u λ u u ShapeMatching,AlignmentandFusion(MAF). with λ ∈ R, and unary potentials given by semantic fore- Given the general formulation in (1) and (2), the key groundconstraintsV ←T: problemstoaddressare: (a)theidentificationofaputative f set of person regions and structural constraints hypotheses 0 ifx =1,u∈/ V T; (b) the construction of an effective, yet flexible data- ∞ ifxu =1,u∈Vb drivenhumanshapepriorS,basedonasufficientlydiverse D (x )= u b (2) λ u ∞ ifx =0,u∈V dataset of people shapes and skeletal structure, given es- f(x )+λ ifxu =0,u∈/ Vf timates for T. (c) minimization of the resulting energy u u f model (1). We address (a) without loss of generality, us- The foreground bias is implemented as a cost incurred ing a human region classifier (any other set of structural, by the assignment of non-seed pixels to background, and problem dependent detectors can be used, here e.g. face consists of a pixel-dependent value f(x ) and an uniform u andhanddetectorsbasedonskincolormodelsorposelets). offset λ. Two different functions f(x ) are used alterna- u We address (b) using methodology that combines a large tively. The first is constant and equal to 0, resulting in a datasetofhumanposeshapesandbodyskeletons,collected uniform (variable) foreground bias. The second function fromHuman3.6M[23]withshapematching,alignmentand uses color. Specifically, RGB color distributions p (x ) f u fusion analysis, in order to construct the prior on the fly, on seed V and p (x ) on seed V are estimated and de- f b u b for the instance being analyzed. We refer to a model that rive f(xu) = lnppfb((xxuu)). The probability distribution of leverages both problem-dependent structural constraints T pixel j belonging to the foreground is defined as p (i) = andadata-drivenshapepriorS, inasinglejointoptimiza- f exp(−γ·min (||I(i)−I(j)||)),withγascalingfactor,and tionproblem,asConstraintParametricProblemDependent j Figure1.Firstrow: OurShapeMatchingAlignmentFusion(MAF)constructionbasedonsemanticmatching,structuralalignmentand clipping,followedbyfusion,toreflectthepartialview. Noticethatthepriorconstructionallowsustomatchpartialviewsofaputative humandetectedsegmenttofullyvisibleexemplarsinHuman3.6M.Thisallowsustohandlearbitrarypatternsofocclusion. Wecanthus createawelladaptedprior,onthefly,givenacandidatesegment. Secondandthirdrows:Examplesofsegmentationsobtainedbyseveral methods(includingtheproposedones),withintersectionoverunion(IoU)scoresandgroundtruthshown. Seefig.5foradditionalimage segmentationresults. CutswithShapeMatching. AlignmentandFusion(CPDC- to humans, using examples from Human3.6M, to obtain MAF). The integration of bottom-up region detection con- D ={d ={z,b},|i=1,...N}. Eachcandidatesegment i straintswithashapepriorconstructionisdescribedin§2.1. isrepresentedbyabinarymaskz ,1standsforforeground i The CPDC-MAF model can be optimized in polynomial and 0 stands for background and a bounding box b ∈ R4 timeusingparametricmax-flow,inordertoobtainallbreak- whereb = (m,n,w,h). mandnrepresenttheimageco- pointsoftheassociatedenergymodel(addressingc). ordinatesofthebottomleftcorneroftheboundingbox, w andhrepresentsitswidthanditsheight. 2.1. Data-Driven Shape Matching, Alignment and We will use the set of human region candidates in or- Fusion(MAF) der to match against a set of human shape and construct a We aim to obtain an improved figure-ground segmen- shapeprior. Therearechallengeshowever,particularlybe- tation for persons by combining bottom-up and top-down, ing able to: (1) access a sufficiently representative set of class specific information. We initialize our proposal set humanshapestoconstructtheprior,(2)besufficientlyflex- using CPMC[13]. While any figure-ground segmentation iblesothathumanshapesfromthedataset,whicharevery proposal method can be employed, in principle, we chose different from the shape being analyzed, would not nega- CPMC due to its performance and because our method tivelyimpactestimates,(3)handlepartialviews—whilewe canbeviewedasageneralizationwithproblemdependent rely on bottom-up proposals that can handle partial views, seedsandshapepriors. WefilterthetopN segmentcandi- theuse,incontrast,ofashapepriorthatcanonlyrepresent, datesusinganO2P[12]-regionclassifiertrainedtorespond e.g. fullorupper-bodyviews,wouldnotbeeffective. We address: (1) by employing a dataset of 100,000 hu- mask. Thus, we obtain the coordinates of the foreground manshapestogetherwiththecorrespondingskeletonstruc- pixels for the transformed mask, Φ and the transformed t ture, sub-sampled from the recently created Human3.6M skeletcoordinatesΨ . t dataset[23]; (2) by employing a matching, alignment and fusiontechniquebetweenthecurrentsegmentandtheindi- Prior Shape Fusion: We compute the mean of the entire vidual exemplar shapes in the dataset. Shapes and struc- setoftransformedmasks, Φt, thusobtainingaMAFprior, tures which cannot be matched and aligned properly are S correspondingtothedetectiondasseeninfigure1,sec- discarded; (3) by leveraging the implicit correspondences ondrow. Thevaluesoftheshapepriormaskrangefrom0 availableacrosstrainingshapes, attheleveloflocalshape to1,backgroundandforegroundprobabilities,respectively. matches, by only aligning and warping those components Alsowecomputethemeanoftheentiresetoftransformed oftheexemplarshapesthatcanbematchedtothequery,at skeletons Ψt, thus obtaining a configuration of keypoints thelevelofjoints. Asampleflowofourentiremethodcan B∈R3×15withBj =(x,y,1)wherexandyrepresentthe bevisualizedinfigure1firstrowandfigure2. image coordinates of the warped joint from Human3.6M. Thiscouldbeusedtoobtainproblemdependentmaskmas Boundary Point Sampling: Given a bottom-up figure- follows. Initiallywesetthemasktohavethesamedimen- ground proposal represented as a binary mask z ∈ D, sionastheentireimage,filledwith0. WeuseBresenham’s we sample through the image coordinates of the bound- algorithmtodrawalinebetweenthesemanticallyadjacent ary points of the foreground segment. Thus we obtain a joints, forexample: leftelbow-leftwrist, righthip-right set of 2D points p ,j = 1,...,K with p ∈ R2 where knee,andsoon. Weassignthesetofskeletonnodestothe j j p = (x ,y ). We loop through the shapes of our hu- foreground as T = {i ∈ V|m(i) = 1}. This entire pro- j j j man shape dataset Human3.6M and for each shape we ro- cedureofobtainingtheshapepriorinformation(maskand tateandscaleitsothatithasthesameorientationandscale skeleton)isillustratedinalgorithm1. as the foreground candidate segment and sample through it boundary points. Thus we obtain a set of 2D points 3.Experiments q ,j = 1,...,K, with l = 1,...,L, where L represents jl We test our methodology on two challenging datasets: the number of poses in the shape-pose dataset, in our case L=100,000. H3D[8]whichcontains107imagesandMPII[2]with3799 images. In all cases we have figure-ground segmentation ShapeMatchingandTransformMatrix: Weemploythe annotations available. For the MPII dataset, we generated shape context descriptor[5] at each position p from the figure-groundhumansegmentannotationsourselves. Both j candidate segment and each position q from each shape the H3D and the MPII datasets contain both full and par- jl fromthedataset. Weevaluateaχ2distanceontheresulting tial views of persons and self-occlusion and are extremely descriptorstoselecttheindexeslwithenoughwell-matched challenging. of boundary points such that we could estimate an affine We run several segmentation algorithms including transform. CPMC[13] as well as our proposed CPDC-MAF where We apply a 2D Procrustes transform with 5 degrees of we use bottom-up person region detectors trained on Hu- freedom(rotation,anisotropicscalingincludingreflections, man3.6M and using region descriptors based on O2P[12]. and translation) on q in order to align each shape in the We also constructed a model referred to as CPDC-MAF- l dataset with the corresponding boundary points. This will POSELETS,builtusingproblemdependentseedsbasedon result in a 3x3 transformation matrix W and an error for a 2D pose detector instead of proposed segments from a l the transform e which represents the Euclidean distance figure-groundsegmentationalgorithm. Whileanymethod- l between the boundary points p and the Procrustes trans- ology that provides body keypoints (parts or articulations) j formedones,W ·q ,intheimageplane. isapplicable,wechosetheposeletdetectorbecauseitpro- l lj videsresultsunderpartialviewsofthebody,orselfocclu- Prior Shape Selection and Warping: In order to deter- sionsofcertainjointstogetherwithjointpositionestimates. minewhichpriorshapesarerelevantforthecurrentdetected Conditionedonadetection,weapplythesameideaasinour query, we identified the subset of indexes in the dataset T CPDC-MAF, except that we use the detected skeletal key- which correspond to transformation errors that are smaller points to match against the exemplars in the Human3.6M thanagiventhreshold(cid:15). Thus,weobtainthecorresponding dataset. A matching process based on semantic keywords figure-groundmasksm ,t ∈ T. Foreachmaskm wese- (the body joints) is explicit, immediate (since joints are t t lected the coordinates of foreground pixels and warp them available both for the putative poselet detector and for the using the transform matrix computed using the 2D joint exemplarshapesinHuman3.6M)andarguablysimplerthan coordinates transformation. We apply the same procedure matchingshapesintheabsenceofskeletalinformation.The totheattachedskeletonconfigurationofthecorresponding downside is that when the poselet detection is incorrect, Algorithm 1 Calculate S and B (Shape Matching, Align- andskeletonseedsweruntheCPDC-MAFmodelwiththe mentandFusion,MAF) resultingpoolsfromeachcandidatesegmentmergedtoob- Require: tainthehumanregionproposalsforanentireimage. d ={z,b} Foreachtestingsetup,wereportthemeanvalues(com- i d ,l=1,...,L-2Djointpositions(Human3.6M) putedovertheentiretestingdataset)oftheintersectionover l m ,l=1,...,L-figure-groundmasks(Human3.6M) union(IoU)scoresforthefirstsegmentintherankedpool l L-numberofposes(Human3.6M,useL=100,000) and the ground-truth figure-ground segmentation for each (cid:15)-thresholdvaluefortransformerror image.WealsoreportthemeanvaluesoftheIoUscoresfor f(·)-shapecontextdescriptor thepoolsegmentwiththebestIoUscorewiththeground- µ-thresholdvalueforχ2forshapecontextdescriptors truthfiguregroundsegmentation. Ensure: S,B Resultsfordifferentdatasetscanbevisualizedintable1. Sampleboundarypointsp ,j =1,...,K onz In turn, figures 3, 4 show plots for the size of the segment j forl∈Ldo poolsandIoUscoresforhighestrankedsegmentsgenerated SampleK boundarypointsq ,j =1,...,K onm by different methods, with image indexes sorted accord- jl l J ={(x,y)∈N2|χ2(f(q ),f(p ))<µ} ing to the best performing method (CPDP-MAF). Qualita- xl y if|J|>2then tivesegmentationresultsforthevariousmethodstestedare a (W)=p −W·q giveninfigure5. jl j jl W =argmin 1 (cid:80) a (W)(cid:62)a (W) l |K| j∈K jl jl W Method H3DTestSet[8] e = 1 (cid:80) a (W )(cid:62)a (W ) l |K| j∈K jl l jl l First Best Poolsize else CPMC[13] 0.54 0.72 783 e =∞ l CPDC-MAF 0.60 0.72 77 endif CPDC-MAF-POSELETS 0.53 0.6 98 endfor T ={l∈L|e <(cid:15)} MPIITestSet[2] l fort∈T do First Best Poolsize V -foregroundpixelsofm ,V -backgroundpixels CPMC[13] 0.29 0.73 686 f t b ofm ,V =V ∪V CPDC-MAF 0.55 0.71 102 t b f foru∈V do CPDC-MAF-POSELETS 0.43 0.58 114 ifu∈V then f Table1.Accuracyandpoolsizestatisticsfordifferentmethods,on Φ (W ·u)=1 t t datafromH3DandMPII.WereportaverageIoUovertestsetfor else thefirstsegmentoftherankedpoolandtheground-truthfigure- Φ (W ·u)=0 t t groundsegmentation(First), theaverageIoUovertestsetofthe endif segmentwiththehighestIoUwiththeground-truthfigure-ground endfor segmentation(Best)andaveragepoolsize(PoolSize). Ψ =W ·d t t l endfor S = |T1|(cid:80)t∈T Φt 4.Conclusions B= 1 (cid:80) Ψ |T| t∈T t We have presented class-specific image segmentation models that leverage human body part detectors based on bottom-up figure-ground proposals, parametric max-flow thematchingwillalsobe(noticethatalignmentswithhigh solvers, and a large dataset of human shapes. Our for- scorefollowingmatchingareneverthelessdiscardedwithin mulation leads to a sub-modular energy model that com- theMAFprocess). bines class-specific structural constraints and data-driven ForCPDC-MAF,weinitialize,bottom-up,byusingcan- shape priors, within a parametric max-flow optimization didate segments from CPMC pool, selected based on their methodology that systematically computes all breakpoints person ranking score after applying the O2P classifier. of the model in polynomial time. We also propose a data- Thisisfollowedbyanon-maximumsuppressionstepwere driven class-specific prior fusion methodology, based on weremovethepairofsegmentswithanoverlapabove0.25. shapematching,alignmentandfusion,thatallowstheshape We use the MAF process to reject irrelevant candidates prior to be constructed on-the-fly, for arbitrary viewpoints and to build shape prior masks and skeleton configuration and partial views. We demonstrate state of the art results seeds for the segments with good matching produced by in two challenging datasets: H3D[8] and MPII[2], where shape context descriptors. On each resulting shape prior we improve the first ranked hypothesis estimates of mid- Figure 3. Dimension of segmentation pool for MPII and various methods along with average pool size (in legend). Notice significant differencebetweenthepoolsizevaluesofCPDC-MAF-POSELETSandCPDC-MAFcomparedtotheonesofCPMC.CPMCpoolsize valuesmaintainanaverageof700units,whereasthepoolsizesofCPDC-MAFandCPDC-MAF-POSELETSareconsiderablysmaller, around100units. Figure4.IoUforthefirstsegmentfromtherankedpoolinMPII.ThevaluesforCPMCandCPDC-MAF-POSELETShavehighervariance comparedtoCPDC-MAFresultingintheperformancedropillustratedbytheiraverage. level segmentation methods by 20%, with pool sizes that analysis. InCVPR,June2014. are up to one order of magnitude smaller. In future work [3] M.Andriluka, S.Roth, andB.Schiele. Pictorialstructures wewillexploreadditionalclass-dependentseedgeneration revisited: Peopledetectionandarticulatedposeestimation. mechanismsandplantostudytheextensionoftheproposed InCVPR,2009. frameworktovideo. [4] P.Arbelaez, M.Maire, C.Fowlkes, andJ.Malik. Contour detectionandhierarchicalimagesegmentation.PAMI,2010. Acknowledgements [5] S.Belongie,J.Malik,andJ.Puzicha. Shapematchingand objectrecognitionusingshapecontexts. PAMI,24(4):509– This work was supported in part by CNCS-UEFISCDI 522,2002. underCT-ERC-2012-1andPCE-2011-3-0438. [6] E.BorensteinandS.Ullman. Class-specific,top-downseg- mentation. InECCV,pages109–122.2002. References [7] L.Bourdev,S.Maji,T.Brox,andJ.Malik.DetectingPeople UsingMutuallyConsistentPoseletActivations. 2010. [1] S. Alpert, M. Galun, R. Basri, and A. Brandt. Image seg- [8] L. Bourdev and J. Malik. Poselets: Body part detectors mentation by probabilistic bottom-up aggregation and cue trainedusing3dhumanposeannotations.InICCV,sep2009. integration. InCVPR,2007. [9] H.BoussaidandI.Kokkinos. Fastandexact: Admm-based [2] M.Andriluka, L.Pishchulin, P.Gehler, andB.Schiele. 2d discriminative shape segmentation with loopy part models. humanposeestimation: Newbenchmarkandstateoftheart InCVPR,pages4058–4065,2014. Figure5.Segmentationexamplesforvariousmethods.Fromlefttoright,originalimage,CPMCwithdefaultsettingsonperson’sbounding box, CPDC-MAF-POSELET and CPDC-MAF. See also tables 1 for quantitative results. Please check our supplementary material for additionalimageresults. [10] M. Bray, P. Kohli, and P. H. S. Torr. Posecut: Simultane- PAMI,2012. ous segmentation and 3d pose estimation of humans using [14] J.Dong,Q.Chen,S.Yan,andA.L.Yuille. Towardsunified dynamicgraph-cuts. InECCV,pages642–655,May2006. objectdetectionandsemanticsegmentation.InECCV,pages [11] T.Brox,L.Bourdev,S.Maji,andJ.Malik. Objectsegmen- 299–314,2014. tationbyalignmentofposeletactivationstoimagecontours. [15] I.EndresandA.Hoiem. Categoryindependentobjectpro- InCVPR,Jun.2011. posals. InECCV,September2010. [12] J.Carreira,R.Caseiro,J.Batista,andC.Sminchisescu. Se- [16] V.Ferrari,M.Marin,andA.Zisserman.PoseSeach:retriev- manticsegmentationwithsecond-orderpooling. InECCV, ingpeopleusingtheirpose. InCVPR,2009. 2012. [17] F. Flohr and D. M. Gavrila. Pedcut: an iterative frame- [13] J. Carreira and C. Sminchisescu. CPMC: Automatic Ob- work for pedestrian segmentation combining shape models jectSegmentationUsingConstrainedParametricMin-Cuts. andmultipledatacues. InBMVC,September2013. [18] G.Gallo,M.D.Grigoriadis,andR.E.Tarjan. Afastpara- [40] B.C.Russell,A.Efros,J.Sivic,W.T.Freeman,andA.Zis- metricmaximumflowalgorithmandapplications. SIAMJ. serman. Segmentingscenesbymatchingimagecomposites. Comput.,18(1):30–55,1989. 2009. [19] G.Ghiasi,Y.Yang,D.Ramanan,andC.C.Fowlkes.Parsing [41] J.Shotton,A.Fitzgibbon,M.Cook,T.Sharp,M.Finocchio, occludedpeople. InCVPR,pages2401–2408,2014. R.Moore,A.Kipman,andA.Blake.Real-TimeHumanPose [20] C.Gu,P.A.Arbela´ez,Y.Lin,K.Yu,andJ.Malik. Multi- RecognitioninPartsfromSingleDepthImages. InCVPR, component models for object detection. In ECCV, pages 2011. 445–458,2012. [42] L.SigalandM.J.Black. Measurelocally,reasonglobally: [21] C. Ionescu, J. Carreira, and C. Sminchisescu. Iterated Occlusion-sensitive articulated pose estimation. In CVPR, Second-OrderLabelSensitivePoolingfor3DHumanPose pages2041–2048,2006. Estimation. InCVPR,June2014. [43] R.UrtasunandT.Darrell.Sparseprobabilisticregressionfor [22] C. Ionescu, F. Li, and C. Sminchisescu. Latent Structured activity-independenthumanposeinference. InCVPR,2008. ModelsforHumanPoseEstimation. InICCV,2011. [44] K. E. van de Sande, J. R. Uijlings, T. Gevers, and A. W. [23] C.Ionescu,D.Papava,V.Olaru,andC.Sminchisescu. Hu- Smeulders. Segmentation as selective search for object man3.6m: Large scale datasets and predictive methods for recognition. InICCV,2011. 3dhumansensinginnaturalenvironments. PAMI,2014. [45] H. Wang and D. Koller. Multi-level inference by relaxed [24] J.KimandK.Grauman.Shapesharingforobjectsegmenta- dualdecompositionforhumanposesegmentation.InCVPR, tion. InECCV.2012. pages2433–2440,2011. [25] V.Kolmogorov,Y.Boykov,andC.Rother. Applicationsof [46] W. Xia, Z. Song, J. Feng, L. F. Cheong, and S. Yan. Seg- parametricmaxflowincomputervision. ICCV,2007. mentationoverdetectionbycoupledglobalandlocalsparse [26] D. Kuettel and V. Ferrari. Figure-ground segmentation by representations. InECCV,pages662–675,2012. transferringwindowmasks. InCVPR,2012. [47] Y.YangandD.Ramanan.ArticulatedHumanDetectionwith [27] M.P.Kumar, P.Torr, andA.Zisserman. Objcut: Efficient FlexibleMixturesofParts. PAMI,2013. segmentation using top-down and bottom-up cues. PAMI, [48] S.Zuffi,O.Freifeld,andM.J.Black. Frompictorialstruc- 2010. turestodeformablestructures. InCVPR,pages3546–3553. [28] L.Ladicky,P.H.S.Torr,andA.Zisserman.Humanposees- IEEE,June2012. timationusingajointpixel-wiseandpart-wiseformulation. [49] S.Zuffi,J.Romero,C.Schmid,andM.J.Black. Estimating InCVPR,pages3578–3585,2013. humanposewithflowingpuppets. 2013. [29] B.Leibe,A.Leonardis,andB.Schiele.Robustobjectdetec- tionwithinterleavedcategorizationandsegmentation.IJCV, 77(1-3):259–289,2008. [30] V. Lempitsky, A. Blake, and C. Rother. Image segmenta- tionbybranch-and-mincut.InECCV,pages15–29.Springer, 2008. [31] V.Lempitsky,A.Blake,andC.Rother. Imagesegmentation bybranch-and-mincut. InECCV,pagesIV:15–29,October 2008. [32] M. Leordeanu, R. Sukthankar, and C. Sminchisescu. Effi- cientClosed-FormSolutiontoGeneralizedBoundaryDetec- tion. InECCV,October2012. [33] A.LevinandY.Weiss. Learningtocombinebottom-upand top-downsegmentation. InECCV,pages581–594.2006. [34] A.Levinshtein,C.Sminchisescu,andS.Dickinson.Optimal contourclosurebysuperpixelgrouping. InECCV,Septem- ber2010. [35] M. Maire, P. Arbelaez, C. Fowlkes, and J. Malik. Using contourstodetectandlocalizejunctionsinnaturalimages. CVPR,0:1–8,June2008. [36] M. Maire, S. X. Yu, and P. Perona. Object detection and segmentationfromjointembeddingofpartsandpixels. In ICCV,2011. [37] T.MalisiewiczandA.Efros. Improvingspatialsupportfor objectsviamultiplesegmentations.BMVC,September2007. [38] L.Pishchulin,M.Andriluka,P.Gehler,andB.Schiele.Pose- letconditionedpictorialstructures. InCVPR,June2013. [39] A. Rosenfeld and D. Weinshall. Extracting foreground maskstowardsobjectrecognition. InICCV,2011.