ebook img

Causal graph-based video segmentation PDF

4 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Causal graph-based video segmentation

Causal graph-based video segmentation CamilleCouprie Cle´mentFarabet YannLeCun CourantInstituteofMathematicalSciences NewYorkUniversity,NewYork,NY10003,USA 3 Abstract to be temporally consistent, thus helping to achieve better 1 0 results.Sincemanyoftheunderlyingalgorithmsareingen- 2 Numerous approaches in image processing and com- eralsuper-linear,thereisoftenaneedtoreducethedimen- puter vision are making use of super-pixels as a pre- sionality of the video. To this end, developing low level n a processing step. Among the different methods producing vision methods for video segmentation is necessary. Cur- J such over-segmentation of an image, the graph-based ap- rently, most video processing approaches are non-causal, 8 proach of Felzenszwalb and Huttenlocher is broadly em- thatistosay, theymakeuseoffutureframestosegmenta ployed. Oneofitsinterestingpropertiesisthattheregions givenframe,sometimesrequiringtheentirevideo[9]. This ] V arecomputedinagreedymannerinquasi-lineartime. The preventstheiruseforreal-timeapplications. C algorithm may be trivially extended to video segmentation Some approaches have been designed to address the by considering a video as a 3D volume, however, this can causalvideosegmentationproblem[15,13].[15]makesuse . s not be the case for causal segmentation, when subsequent ofthemeanshiftmethod[2].Asthismethodworksinafea- c [ framesareunknown. Weproposeanefficientvideosegmen- turespace,itdoesnotnecessaryclusterspatiallyconsistent tationapproachthatcomputestemporallyconsistentpixels super-pixels. Amorerecentapproach, specificallyapplied 1 inacausalmanner,fillingtheneedforcausalandrealtime forsemanticsegmentation,istheoneofMiksiketal. [13]. v 1 applications. The work of [13] is employing an optical flow method to 7 enforcethetemporalconsistencyofthesemanticsegmenta- 6 tion. Our approach is different because it aims to produce 1 1.Introduction super-pixels, and possibly uses the produced super-pixels . 1 forsmoothingsemanticsegmentationresults. Furthermore, Asegmentationofvideointoconsistentspatio-temporal 0 wedonotuseanyopticalflowpre-computationthatwould 3 segments is a largely unsolved problem. While there have preventushavingrealtimeperformancesonaCPU. 1 beenattemptsatvideosegmentation,mostmethodsarenon : causalandnonreal-time.Thispaperproposesafastmethod Some works use the idea of enforcing some consis- v i for real time video segmentation, including semantic seg- tency between different segmentations [7, 12, 11, 8]. [7] X mentationasanapplication. formulates a co-clustering problem as a Quadratic Semi- r An large number of approaches in computer vision Assignment Problem. However solving the problem for a a makesuseofsuper-pixelsatsomepointintheprocess. For pairofimagestakesaboutaminute. Alternatively,[12]and example,semanticsegmentation[5],geometriccontextin- [8] identify the corresponding regions using graph match- dentification [10], extraction of support relations between ing techniques. The approach of [8] is only illustrated on object in scenes [14], etc. Among the most popular ap- verycoarseimagesegmentation,andthegraphmatchingis proachforsuper-pixelsegmentation,twotypesofmethods performedbetweengraphsthatcontainadozenofregions. aredistinguishable.Regularshapesuper-pixelsmaybepro- In both cases, the number of tracked regions is limited in ducedusingnormalizedcuts[16]forinstance. Moreobject theexperimentstoasmallamount. – or part of object – shaped super-pixels can be generated Theideadeveloppedinthispaperistoperformindepen- fromwatershedbasedapproaches. Inparticularthemethod dentsegmentationsandmatchtheproducedsuper-pixelsto ofFelzenswalbandHuttenlocher[6]producessuchresults. define markers. The markers are then used to produce the It is a real challenge to obtain a decent delineation of finalsegmentationbyminimizingaglobalcriteriondefined objects from a single image. When it comes to real-time ontheimage. Thegraphused intheindependentsegmen- data analysis, the problem is even more difficult. How- tationpartisreusedinthefinalsegmentationstage,leading ever, additional cues can be used to constrain the solution thustogainsinspeed,andreal-timeperformancesonasin- 1 glecoreCPU. 2.Method GivenasegmentationS ofanimageattimet,wewish t tocomputeasegmentationS oftheimageattimet+1 t+1 whichisconsistentwiththesegmentsoftheresultattimet. 2.1.Independentimagesegmentation The super-pixels produced by [6] have been shown to satisfytheglobalpropertiesofbeingnottoocoarseandnot too fine according to a particular region comparison func- tion. Inordertogeneratesuperpixelsclosetotheonespro- duced by [6], we first generate independent segmentations Figure1.Illustrationofthegraphmatchingprocedure ofthe2Dimagesusing[6]. Wenamethesesegmentations S(cid:48),...,S(cid:48). The principle of segmentation is fairly simple. 1 t Theedgesweightsbetweenvertexi∈V andj ∈V(cid:48) are We define a graph G , where the nodes correspond to the t t+1 t given by a similarity measure taking into account distance image pixels, and the edges link neighboring nodes in 8- anddifferencesbetweenshapeandappearance connectivity. Theedgeweightsω betweennodesiandj ij aregivenbyacolorgradientoftheimage. (|r |+|r |)d(c ,c ) AMinimumSpanningTree(MST)isbuildonG ,anda w = i j i j +a , (2) t ij |r ∩r | ij regionmergingcriteriaisdefined. Specifically,tworegions i j X andY aremergedwhen: where |r | denotes the number of pixels of region r , i i (cid:18) (cid:19) |r ∩ r | the number of pixels present in r and r with k k i j i j Diff(X,Y)≤min Int(X)+ ,Int(Y)+ , (1) aligned centroids, and a the appearance difference of re- |X| |Y| ij gionsr andr . Inourexperimentsa wasdefinedasthe i j ij wherek isaparameterallowingtopreventthemergingof differencebetweenmeancolorintensitiesoftheregions. large regions. The internal difference Int(X) of a region ThegraphmatchingprocedureisillustratedinFigure1 X is the highest weight of an edge linking two vertices of andproducesthefollowingresult: ForeachregionofS(cid:48) , t+1 X in the MST. The difference Diff(X,Y) between two itsbestcorrespondingregioninimageS isidentified.More t neighboring regions X and Y is the smallest weight of an specifically, each node i of V is associated with the node t edgethatlinksX toY. j of V(cid:48) which minimizes w . Symmetrically, for each t+1 ij regionofS ,itsbestcorrespondingregioninimageS(cid:48) is t t+1 Onceanimageisindependentlysegmented,resultingin identified, that is to say each node i of V(cid:48) is associated t+1 St(cid:48)+1, we then face the question of the propagation of the withthenodej ofVtwhichminimizeswij. temporalconsistencygiventhenonoverlappingcontoursof S andS(cid:48) . 2.3.Finalsegmentationprocedure t t+1 Oursolutionisthedevelopmentofacheapgraphmatch- The final segmentation S is computed using a min- t+1 ingtechniquetoobtaincorrespondencesbetweensegments imum spanning forest procedure. This seeded segmenta- from S and these of S(cid:48) . This first step is described in t t+1 tionalgorithmthatproduceswatershedcuts[4]isstrongly Section 2.2. We then mine these correspondences to cre- linkedtoglobalenergyoptimizationmethodssuchasgraph- atemarkers(alsocalledseeds)tocomputethefinallabeling cuts[1,3]asdetailedinSection2.4. Inadditiontotheoreti- S bysolvingaglobaloptimizationproblem.Thissecond t+1 calguarantiesofoptimality,thischoiceofalgorithmismo- stepisdetailedinSection2.3. tivatedbytheopportunitytoreusethesortingofedgesthat 2.2.Graphmatchingprocedure isperformedin2.1andconstitutesthemaincomputational effort. Consequently, wereuseherethegraphG (V,E) t+1 The basic idea is to use the segmentation St and seg- builtfortheproductionofindependentsegmentationS(cid:48) . mentationS(cid:48) toproducemarkersbeforeafinalsegmen- t+1 t+1 The minimum spanning forest algorithm is recalled in tation of image at time t+1. Therefore, in the process of Algorithm 1. The seeds, or markers, are defined using the computinganewsegmentationS ,agraphGisdefined. t+1 regionscorrespondencescomputedintheprevioussection, TheverticesofGcomprisestotwosetsofvertices: V that t according to the procedure detailed below. For each seg- corresponds to the set of regions of St and Vt(cid:48)+1 that cor- ments(cid:48)ofS(cid:48) fourcasesmayappear: responds to the set of regions of S(cid:48) . Edges link regions t+1 t+1 characterised by a small distance between their centroids. 1. s(cid:48)hasoneandonlyonematchingregionsinS :prop- t 2 Algorithm1: MinimumSpanningForestalgorithm Data: AweightedgraphG(V,E)andasetoflabeled nodesmakersL. NodesofV \Lhaveunknown labelsinitially. Result: Alabelingxassociatingalabeltoeachvertex. SorttheedgesofE byincreasingorderofweight. whileanynodehasanunknownlabeldo Findtheedgee inE ofminimalweight; ij ifv orv haveunknownlabelvaluesthen i j Mergev andv intoasinglenode,suchthat i j whenthevalueforthismergednodebecomes known,allmergednodesareassignedthe samevalueofxandconsideredknown. agate the label l of region s. All nodes of s(cid:48) are la- s beledwiththelabell ofregions. s 2. s(cid:48) has several corresponding regions s ,...,s : prop- 1 r agate seeds from S . The coordinates of regions t s ,...,s are centered on region s(cid:48). The labels of re- 1 r gions s ,...,s whose coordinates are in the range of Independentsegmentations Temporallyconsistentsegm- 1 r s(cid:48)arepropagatedtothenodesofs(cid:48). S1(cid:48),S2(cid:48) andS3(cid:48) entationsS1(=S1(cid:48)),S2,andS3 3. s(cid:48) has no matching region : The region is labeled by Figure 2. Segmentation results on 3 consecutive frames of the thelabell(cid:48) itself. NYU-Scenedataset. s 4. Ifnoneofthepreviouscasesisfulfilled,itmeansthat forest.Thereciprocalisalsotrueiftheweightsofthegraph s(cid:48) ispartofalargerregionsinS . Ifthesizeofs(cid:48) is t arealldifferent. small,anewlabeliscreated. Otherwise,thelabell is s Inthecaseofourapplication,thepairwiseweightsw is propagatedins(cid:48)asincase1. ij givenbyaninversefunctionoftheoriginalweightsω .The ij pairwisetermthuspenalizesanyunwantedhigh-frequency Beforeapplyingtheminimumspanningforestalgorithm, contentinxandessentiallyforcesxtovarysmoothlywithin asafetytestisperformedtocheckthatthemapofproduced an object, while allowing large changes across the object markers does not differ two much from the segmentation S(cid:48) . Ifthetestshowslargedifferences,anerodedmapof boundaries. The second term enforces fidelity of x to a t+1 S(cid:48) isusedtocorrectthemarkers. specified configuration l, wi being the unary weights en- t+1 forcingthatfidelity. 2.4.Globaloptimizationguaranties Theenforcementofmarkersl ashardconstrainedmay s be viewed as follows: A node of label l is added to the Several graph-based segmentation problems, including s graph,andlinkedtoallnodesiofV thataresupposedtobe minimum spanning forests, graph cuts, random walks and marked. The unary weights ω are set to arbitrary large shortestpathshaverecentlybeenshowntobelongtoacom- i,ls valuesinordertoimposethemarkers. mon energy minimization framework [3]. The considered problemistofindalabelingx∗ ∈R|V|definedonthenodes 2.5. Applications to optical flow and semantic seg- ofagraphthatminimizes mentation An optical flow map may be easily estimated from two E(x)= (cid:88) wipj|xj −xi|q+ (cid:88) wip|li−xi|q, (3) successivesegmentationsStandSt+1. Foreachregionrof S ,ifthelabelofrcomesfromalabelpresentinaregion eij∈E vi∈V t+1 softhesegmentationS ,theopticalflowinr iscomputed t where l represents a given configuration and x asthedistancebetweenthecentroidofrandthecentroidof represents the target configuration. The result of s. Theopticalflowmapmaybeusedasasanitycheckfor lim argmin E(x) for values of q ≥ 1 always pro- regiontrackingapplications.Byprinciple,avideosequence p→∞ x ducesacutbymaximum(equivalentlyminimum)spanning willnotcontaindisplacementsofobjectsgreaterthanacer- 3 Figure3.Comparisonwiththemean-shiftsegmentationmethodofParis[15]onFrame19and20.k=200,δ=400,σ=0.5. tainvalue. largevariationsintheregionsizesandlargemovementsof For each superpixel s of S , if the label of region s cameraisillustratedonFigure2. t+1 comesfromtheprevioussegmentationSt,thenthesemantic A comparison with the temporal mean shift segmenta- prediction from St is propagated to St+1. Otherwise, in tion of Paris [15] is displayed at Figure 3. The super- casethelabelofsisanewlabel,thesemanticpredictionis pixels produced by the [15] are not spatially consistent as computedusingthepredictionattimet+1. Assomeerrors the segmentation is performed in the feature (color) space mayappearintheregionstracking,thelabelsofregionsthat intheircase. Ourapproachisslower,althoughqualifiedfor haveinconsistentlargevaluesinopticalflowmapsarenot real-timeapplicationbutcomputesonlyspatiallyconsistent propagated. super-pixels. For the specific task of semantic segmentation, results can be improved by exploiting the contours of the recog- 3.2.Semanticscenelabeling nizedobjects. Semanticcontourssuchasforexampletran- sitionbetweenabuildingandatreeforinstance,mightnot Wesupposethatwearegivenanoisysemanticlabeling bepresentinthegradientoftherawimage. Thus,inaddi- for each frame. In this work we used the semantic predic- tiontothepairwiseweightsωdescribedinSection2.1, we tionsof[5]. addaconstantinthepresenceofasemanticcontour. We compare our results with the results of [13] on the NYU-Scene dataset. The dataset consists in a video se- 3.Results quence of 73 frames provided with a dense semantic la- Wenowdemonstratetheefficiencyandversatilityofour belingandgroundtruths. Theprovideddenselabelingbe- approachbyapplyingittodifferentproblems:simplesuper- ingperformedwithnotemporalexploitation,itsuffersfrom pixel segmentation, semantic scene labeling, and optical suddenlargeobjectappearancesanddisappearances. Asil- flow. lustrated in Figure 4 our approach reduces this effect, and Followingtheimplementationof[6],wepre-processthe improves the classification performance of more than 5% imagesusingaGaussianfilteringstepwithakernelofvari- asreportedinTable3.2. Wealsodisplayresultsonanother ance σ is employed. A post-processing step that removes videoinFigure(5). regions of small size, that is to say below a threshold δ is alsoperformed. Asin[6], wedenotethescaleofobserva- 3.3.Opticalflow tionparameterbyk. AsdetailedinSection2.5, wecomputetheopticalflow 3.1.Super-pixelsegmentation maps between subsequent frames of videos. An example Experiments are performed on two different types of of result is shown in Figure 5, that illustrates the accurate videos:videoswherethecameraisstatic,andvideoswhere detectionofalargemoveofthecamerafromtheFrames9 the camera is moving. The robustness of our approach to to10ofavideotakenonahighway. 4 (a)Independentsegmentationswithnotemporalsmoothing (b)Resultusingthetemporalsmoothingmethodof[13] (c)Ourtemporallyconsistentsegmentation Legend: balcony car person sidewalk tree awning building door road sun window Figure4.Comparisonwiththetemporalsmoothingmethodof[13].Parametersused:k=1200,δ=100,σ=1.2. Frame Miksik Our instance on embedded devices where a GPU is often not byframe etal.[13] method available. Accuracy 71.11 75.31 76.27 #Frames/sec 1.33∗ 10.5 4.Conclusion Table1.Overallpixelaccuracy(%)forthesemanticsegmentation taskontheNYUScenevideo.∗Notethatthereportedtimingdoes Theproposedapproachemploysagraphmatchingtech- nottakeintoaccounttheopticalflowcomputationneededby[13]. niquetoproducemarkersusedinaglobaloptimizationpro- cedureforvideosegmentation.Unlikemanyvideosegmen- tation techniques, our algorithm is causal – which is a re- 3.4.Computationtime quired property for real-time applications – and does not require any computation of optical flow. Our experiments The experiments were performed on a laptop with 2.3 on challenging videos show that the obtained super-pixels GHz intel core i7-3615QM, Memory 8Go DDR3 1600 are robust to large camera or objects displacement. Their MHz. OurmethodisimplementedonCPUonly,inC/C++, useinsemanticsegmentationapplicationsdemonstratethat and makes use of only one core of the processor. Super- significant gains can be achieved and lead to state-of-the- pixel segmentations take 0.1 seconds per image of size art results. Furthermore, by being 8 times faster than the 320×240 and 0.4 seconds per image of size 640×380, competingmethodfortemporalsmoothingofsemanticseg- thusdemonstratingthescalabilityofthepipeline. Allcom- mentation,andupto25timesfasteriftheuseofGPUisnot putationsareincludedinthereportedtimings. available, the proposed approach has by itself a practical ThetimingsofthetemporalsmoothingmethodofMik- interest. sik et al.[13] are reported in Table 3.2. We note that the processor used for the reported timings of [13] has similar characteristicsasours.Furthermore,Mistiketal.useanop- Acknowledgments ticalflowprocedurethattakesonly0.02secondsperframe whenimplementedonGPU,buttakessecondsonCPU.Our The authors would like to thank Laurent Najman for approachisthusmoreadaptedtorealtimeapplicationsfor fruitfuldiscussions. 5 (a) Frame9 (b) Frame10 (c) SegmentationS9 (d) SegmentationS10 (e) Opticalflow (f) Legend (g) Framebyframesegm. (h) Oursemanticsegm. Figure5.Bymatchingregions,ourmethodcanbeusedtoderiveanopticalflow(e).Parametersused:k=200,δ=400,σ=0.81 References [9] M. Grundmann, V. Kwatra, M. Han, and I. Essa. Effi- cienthierarchicalgraph-basedvideosegmentation. InProc. [1] C.Alle`ne,J.-Y.Audibert,M.Couprie,andR.Keriven.Some of IEEE Computer Vision and Pattern Recognition (CVPR links between extremum spanning forests, watersheds and 2010),2010. 1 min-cuts. ImageandVisionComputing,28(10):1460–1471, [10] D.Hoiem,A.Efros,andM.Hebert.Geometriccontextfrom 2010. 2 asingleimage. InIEEEInternationalConferenceonCom- [2] D.Comaniciu,P.Meer,andS.Member.Meanshift:Arobust puterVision(ICCV2005),volume1,pages654–661Vol.1, approachtowardfeaturespaceanalysis. IEEETransactions oct.2005. 1 onPatternAnalysisandMachineIntelligence,24:603–619, [11] A.Joulin,F.Bach,andJ.Ponce.Multi-classcosegmentation. 2002. 1 InProc.ofIEEEComputerVisionandPatternRecognition [3] C.Couprie, L.J.Grady, L.Najman, andH.Talbot. Power (CVPR2012),Providence,RI,USA,pages542–549,2012.1 watershed:Aunifyinggraph-basedoptimizationframework. [12] J. Lee, J.-H. Oh, and S. Hwang. Clustering of video ob- IEEETrans.onPatternAnalysisandMachineIntelligence, jectsbygraphmatching. InProc.oftheIEEEInternational 33(7):1384–1399,2011. 2,3 ConferenceonMultimediaandExpo,ICME2005,July6-9, [4] J.Cousty,G.Bertrand,L.Najman,andM.Couprie. Water- 2005,Amsterdam,TheNetherlands,pages394–397,2005.1 shedcuts: Minimumspanningforestsandthedropofwater [13] O.Miksik,D.Munoz,J.A.D.Bagnell,andM.Hebert.Effi- principle. IEEETrans.PatternAnalysisandMachineIntel- cienttemporalconsistencyforstreamingvideosceneanaly- ligence,31(8):1362–1374,2009. 2 sis.TechnicalReportCMU-RI-TR-12-30,RoboticsInstitute, [5] C.Farabet,C.Couprie,L.Najman,andY.LeCun. Learning Pittsburgh,PA,September2012. 1,4,5 hierarchicalfeaturesforscenelabeling. IEEETrans.onPat- [14] P. K. Nathan Silberman, Derek Hoiem and R. Fergus. In- ternAnalysisandMachineIntelligence, 2013. inpress. 1, doorsegmentationandsupportinferencefromrgbdimages. 4 InProc.ofIEEEEuropeanConferenceonComputerVision [6] P.F.FelzenszwalbandD.P.Huttenlocher. Efficientgraph- (ECCV2012),2012. 1 based image segmentation. International Journal of Com- [15] S. Paris. Edge-preserving smoothing and mean-shift seg- puterVision,59(2):167–181,2004. 1,2,4 mentation of video streams. In Proc. of IEEE European [7] D. Glasner, S. N. Vitaladevuni, and R. Basri. Contour- Conference on Computer Vision (ECCV 2008), Marseille, based joint clustering of multiple segmentations. In Proc. France,pages460–473,2008. 1,4 of IEEE Computer Vision and Pattern Recognition (CVPR [16] J.ShiandJ.Malik. Normalizedcutsandimagesegmenta- 2011),pages2385–2392,Washington,DC,USA,2011. 1 tion. IEEE Transactions on Pattern Analysis and Machine [8] C. Gomila and F. Meyer. Graph-based object tracking. In Intelligence,22:888–905,1997. 1 Proc.ofIEEEInternationalConferenceonImageProcess- ing(ICIP2003),volume2,pagesII–41–4vol.3,sept.2003. 1 6

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.