ebook img

Deep Multitask Architecture for Integrated 2D and 3D Human Sensing PDF

7.3 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Deep Multitask Architecture for Integrated 2D and 3D Human Sensing

Deep Multitask Architecture for Integrated 2D and 3D Human Sensing Alin-IonutPopa2∗,MihaiZanfir2∗,CristianSminchisescu1,2 [email protected], [email protected] [email protected] 1DepartmentofMathematics,FacultyofEngineering,LundUniversity 2InstituteofMathematicsoftheRomanianAcademy 7 1 Abstract 3d reconstruction. This leads to geometric ambiguity and 0 2 occlusionwhicharedifficulttopalliate comparedtositua- Weproposeadeepmultitaskarchitectureforfullyauto- tionswheremultiplecamerasarepresent. n matic2dand3dhumansensing(DMHS),includingrecog- Adetailedanalysisatboth2dand3dlevels, furtherex- a J nition and reconstruction, in monocular images. The sys- posestheneedforbothmeasurementandprior-knowledge, 1 temcomputesthefigure-groundsegmentation,semantically and the necessary inter-play between segmentation, recon- 3 identifiesthehumanbodypartsatpixellevel,andestimates struction, and recognition within models that can jointly the 2d and 3d pose of the person. The model supports performalltasks. ] V thejointtrainingofallcomponentsbymeansofmulti-task As training is essential, a major difficulty is the limited losses where early processing stages recursively feed into coverageofcurrentdatasets: 2drepositorieslikeLSP[18] C advancedonesforincreasinglycomplexcalculations,accu- orMPI-II[3]exhibitchallengingbackgrounds,humanbody . s racyandrobustness. Thedesignallowsustotieacomplete proportions, clothing, and poses, but offer single views, c trainingprotocol,bytakingadvantageofmultipledatasets carryonlyapproximate2djointlocationgroundtruth, and [ that would otherwise restrictively cover only some of the donotcarryhumansegmentationorbodypartlabelingin- 1 model components: complex 2d image data with no body formation.Theirsizeisalsorelativelysmallbytoday’sdeep v part labeling and without associated 3d ground truth, or learningstandards. Incontrast,3ddatasetslikeHumanEva 5 8 complex3ddatawithlimited2dbackgroundvariability. In [38]orHuman3.6M[16]offerextremelyaccurate2dand3d 9 detailed experiments based on several challenging 2d and anatomical joint or body surface reconstructions and a va- 8 3d datasets (LSP, HumanEva, Human3.6M), we evaluate rietyofposescapturedundermultipleviewpoints. Human 0 the sub-structures of the model, the effect of various types 3.6Misalsolarge-scale. However,beingcapturedindoors, . 1 oftrainingdatainthemultitaskloss,anddemonstratethat the 3d datasets typically lack the background and clothing 0 state-of-the-art results can be achieved at all processing variability that represent a strength of the 2d datasets cap- 7 levels. We also show that in the wild our monocular RGB turedinthewild.1 Anopenquestionishowonecanlever- 1 architectureisperceptuallycompetitivetoastate-of-theart agetheseparatestrengthsofexisting2dand3ddatasetsto- : v (commercial)KinectsystembasedonRGB-Ddata. wardstrainingamodelthatcanoperateinrealisticimages i X and can offer accurate recognition and reconstruction esti- mates. r a 1.Introduction In this paper we propose one such deep learning model which, given a monocular RGB image, is able to fully The visual analysis of humans has a broad spectrum of automatically sense the humans at multiple levels of de- applications as diverse as autonomous vehicles, robotics, tail:figure-groundsegmentation,body-partlabelingatpixel human-computer interaction, virtual reality, and digital li- level, as well as 2d and 3d pose estimation. By designing braries, among others. The problem is challenging due to multi-task loss functions at different, recursive processing thelargevarietyofhumanposesandbodyproportions,oc- stages (human body joint detection and 2d pose estima- clusion, and the diversity of scenes, angles of observation, tion, semantic body part segmentation, 3d reconstruction) andbackgroundshumansarepicturedagainst. Themonoc- we are able to tie complete, realistic training scenarios by ular case, which is central and intrinsic in many scenarios 1Thesituationisslightlymorenuancedassomeofthe3ddatasets(e.g. like the analysis of photographs or video available on the Human3.6M)comewithmixed-realitytrainingsetupswhereamoderately web, adds complexity as depth information is missing for realisticgraphicscharacterwasplaced,inageometricallycorrectsetup, intoarealsceneandanimatedusinghumanmotioncapture–arguably, ∗Authorscontributedequally though,acomplete,fullyrealistic2dand3dtrainingsettingisstillelusive. 1 takingadvantageofmultipledatasetsthatwouldotherwise inrealisticimagingconditions[31,45,2]. Themostrecent restrictivelycoveronlysomeofthemodelcomponenttrain- methods,[6,33],relyonana-priori3dmodelthatisfittedto ing(complex2dimagedatawithnobodypartlabelingand anatomicalbodyjointdataobtainedfromaninitial2dpose withoutassociated3dgroundtruth,orcomplex3ddatawith estimate produced by a deep processing architecture, like limited2dbackgroundvariability),leadingtocovariateshift theoneof[13]or[43]. Themethodsrelyonastate-of-the- and a lack of model expressiveness. In extensive exper- art discriminative person detection and 2d pose estimation iments, including ablation studies performed using repre- andintroduceagenerativefittingcomponenttosearchina sentative 2d and 3d datasets like LSP, HumanEva, or Hu- space of admissible body proportion variations. Both ap- man3.6M, we illustrate the model and show that state-of- proachesfitastatisticalbodyshapeandkinematicmodelto the-artresultscanbeachievedforbothsemanticbodypart data, usingoneormultipleviewsandthejointassignment segmentationandfor3dposeestimation. constraintsfrom2dhumanposeestimation. Thesemethods usethe3dto2danatomicallandmarkassignmentprovided 2.RelatedWork by [6, 33] as initialization for 3d human pose estimation. Whilethisiseffective,asshowninseveralchallengingeval- This work relates to 2d and 3d monocular human pose uationscenarios,theinitial3dshapeandkinematicconfig- estimation methods as well as semantic segmentation us- urationstillneedstobeinitializedmanuallyorsettoakey ingfully-trainabledeepprocessingarchitectures. Asprior- initialpose.Thisisinprinciplestillpronetolocaloptimaas work is comprehensive in each sub-domain [35], we will themonocular3dhumanposeestimationcostisnon-linear heremostlycoversomeofthekeytechniquesdirectlyrelat- and non-convex even under perfect 3d to 2d model-image ingtoourapproach,withanemphasistowardsdeeparchi- assignments[40]. tecturesandmethodologiesaimingtointegratethedifferent Wesharewith[15,6,33]theinterestinbuildingmodels levelsof2dand3dprocessing. thatintegrate2dand3dreasoning.Weproposeafullytrain- The problem of 2d human pose estimation has been ablediscriminativemodelforhumanrecognitionandrecon- approached initially using the pictorial structures and de- struction at 2d and 3d levels. We do not estimate human formablepartmodelswherethekinematictreestructureof bodyshape,butwedoestimatefigure-groundsegmentation, the human body offers a natural decomposition [11, 19, 7, thesemanticsegmentationofthehumanbodyparts,aswell 47,29,12]. Inrecentyears,thedeeplearningmethodology asthe2dand3dpose. Thesystemistrainable,end-to-end, hadagreatimpactoverthestate-of-the-artposeestimation bymeansofmultitasklossesthatcanleveragethecomple- models where different hierarchical feature extraction ar- mentary properties of existing 2d and 3d human datasets. chitectureshavebeencombinedwithspatialconstraintsbe- Themodelisfullyautomaticinthesensethatboththehu- tween the human body parts [44, 10, 43, 20, 30, 28, 23, man detection and body part segmentation and the 2d and 8, 13, 9, 41, 27, 24]. Recent deep architectures are ob- 3destimatesaretheresultofrecurrentstagesofprocessing tainedbycascadingprocessingstageswithsimilarstructure inahomogeneous,easytounderstandandcomputationally but different parameters, where the system combines out- efficient architecture. Our approach is complementary to putsofearlylayerswithnewinformationextracteddirectly [6, 33]: our model can benefit from a final optimization- fromtheimageusinglearntfeatureextractors. Suchrecur- basedrefinementanditwouldbeusefultoestimatethehu- sive schemes for 2d pose estimation appear in the work of man body shape. In contrast, [6, 33] can benefit from the [32, 46, 9] whereas a similar idea of feeding-back 3d esti- semantic segmentation of the human body parts for their matesinto2dinferencelayersappearsin[15].Notethatthe shapefitting,andcouldusetheaccuratefullyautomatic2d iterative processing frameworks of [32, 15] are not deep, and3dposeestimatesweproduceasinitializationfortheir built on parts-based graphical models and random forests, 3dto2drefinementprocess. respectively. There is a vast literature on monocular 3d human pose 3.Methodology estimation, including the analysis of kinematic ambigui- ties associated with the 3d pose [40, 5], as well as gener- In this section we present our multitask multistage ar- ativeanddiscriminativemethodsforlearningandinference chitecture. The idea of using multiple stages of recurrent [34, 36, 4, 1, 26, 37, 14, 30]. More recently, deep convo- feed-forward processing is inspired by architectures like lutional architectures have been employed in order to esti- [32, 46, 9] which focus separately on the 2d and 3d do- mate3dposedirectlyfromimages[21,22,33,49]mostlyin mains. However, we propose an uniform architecture for connectionwith3dhumanmotioncapturedatasetslikeHu- joint2dand3dprocessingthatnopriormethodcovers. Our manEva,Human3.6M,wheretheposesarechallengingbut choice of multi-task loss also makes it possible to exploit the backgrounds are relatively simple. There is also inter- thecomplementaryadvantagesofdifferentdatasets. estincombining2dand3destimationmethodsinorderto Conceptually,eachstageofprocessinginourmodelpro- obtainmodelscapableofmultipletask,andabletooperate ducesrecognitionandreconstructionestimatesandiscon- strainedbyspecifictraininglosses. Specifically,eachstage operates only on image evidence (a set of seven convolu- tissplitintosemanticprocessingSt,and3dreconstruction, tionandthreepoolinglayersproducingfeaturesx)butfor Rt (seefig. 1). Atthesametime,thesemanticmoduleSt subsequentstages,thenetworkalsoconsiderstheinforma- is divided in two sub-tasks, one focusing on 2d pose esti- tion in the belief maps fed from the previous stage Jt−1 mation,Jt,andtheotheronbodypartlabelingandfigure- with a slightly different image feature function x(cid:48), defined ground segmentation, Bt (see fig. 2). The first one (i.e. as a set of four convolution and three pooling layers as in Jt)feedsintothesecondone(i.e. Bt),whilethesemantic [46]. These features are transformed through a classifica- stages feed into the reconstruction stages. Each task con- tion function ct to predict the body joint belief maps Jt. J sists of a total of six recurrent stages which take as input The function ct consists of a series of five convolutional J the image, the results of the previous stages of the same layers,thefirstthreeoftheform(11×11×128),followed type(exceptforthefirstone), aswellasinputsfromother by a (1×1×128) convolution and a final (1×1×N ) J stages(2dposeestimationfeedingintosemanticbodypart convolutionthatoutputsJt. Thelossfunctionateachstage segmentationandbothfeedinginto3dposereconstruction). Lt minimizes the squared Euclidean distance between the J The inputs to each stage are individually processed and predictedandgroundtruthbeliefmaps,JtandJ(cid:63): fused via convolutional networks in order to produce the correspondingoutputs. Lt =(cid:88)NJ (cid:88)(cid:13)(cid:13)Jt(z,k)−J(cid:63)(z,k)(cid:13)(cid:13)2 (1) J 2 k=1z∈Z Inpractice, thiscomponentofthemodelcanbetrained withdatafromboth2dand3ddatasetslikeLSP,wherethe ground truth is obtained manually, or HumanEva and Hu- man3.6Mwherethegroundtruthisobtainedautomatically basedonanatomicalmarkers. 3.2.SemanticBodyPartSegmentation Insemanticbodypartsegmentation(bodypartlabeling) Figure 1. Stage t of our architecture for recognition and recon- weassigneachimagelocation(u,v)∈Z ⊂R2oneofN B struction: figure-ground segmentation, 2d pose estimation, and anatomical body part labels (including an additional label semanticsegmentationofbodyparts, alldenotedby(S)and3d forbackground),b ,wherel∈{1...N }. Ateachstaget, l B reconstruction(R).Thesemantictaskisdetailedinfig.2andfig. thenetworkpredicts,foreachpixellocation,theprobability 3;the3dreconstructiontaskisdetailedinfig.4. ofeachbodypartbeingpresent, Bt ∈ Rw×h×NB. Differ- entlyfromtheprevioustask,ouraimnowistoclassifyeach pixellocation,notonlytoidentifythebodyjoints. Theloss function used changes from a squared Euclidean loss to a multinomiallogisticloss: 1 (cid:88) Lt =− log(Bt ) (2) B |Z| z,Bz(cid:63) z∈Z whereB(cid:63) isthegroundtruthlabelforeachimagelocation z z =(u,v). Duringthefirststageofprocessing,weuseconvolutional Figure2.Stagetofoursemantictask,including2djointdetection representationsbasedontheimage(aseriesofconvolution (J),andlabelingofthebodyparts(B). and pooling layers x with parameters tied from §3.1) and the 2d pose belief maps J1 in order to predict the current body labels B1. For each of the following stages, we also 3.1.2DHumanBodyJointDetection usetheinformationpresentinthebodylabelsattheprevi- The2dposeestimationtaskisbasedonarecurrentcon- ousstage,Bt−1,andrelyonaseriesoffourconvolutional volutionalarchitecturesimilarto[46].GivenanRGBimage layers ct that learn to combine inputs obtained by stack- B I ∈ Rw×h×3, weseektocorrectlypredictthelocationsof ingimagefeaturesxandBt−1. Thefunctionct sharesthe B N anatomicallydefinedhumanbodyjointsp ∈Z ⊂R2, samestructureasthefirstfourconvolutionsinct,butaclas- J k J with k ∈ {1...N }. At each stage t ∈ {1...T}, where sifierintheformofa(1×1×N )convolutionisapplied J B T is the total number of stages, the network outputs be- afterthefusionwiththecurrent2dposebeliefmapsJt,in lief maps Jt ∈ Rw×h×NJ. The first stage of processing ordertoobtainsemanticprobabilitymapsBt. Anoverview of our architecture together with the main dependencies is 3.4.IntegratedMulti-taskMulti-stageLoss giveninfigure3. Finally,weuseanadditionaldeconvolu- Giventheinformationprovidedintheprevioussections, tionlayer[25]ofsize16×16×N ,suchthatthelosscan B we are now able to define the complete multitask, multi- becomputedatthefullresolutionoftheinputimageI. stagelossfunctionofthemodelasfollows: In practice, realistic data for training this component of thelossisnotaseasytoobtainas2dbodyjointpositions. T (cid:88) Human3.6Mofferssuchtrainingdata,butwearealsoable L= (Lt +Lt +Lt ) (4) J B R togenerateitapproximatelyforLSP(see§4). t=1 The loss allows us to conveniently train all the model 3.3.3DPoseReconstruction componentparameters,fordifferenttasks,basedondatasets wheretheannotationsarechallenging,orwhereannotations This model component is designed for the stage-wise, aremissingentirely,asdatasetswithdifferentcoveragecon- recurrent reconstruction of 3d human body configurations tribute to various components of the loss. Whenever we represented as set of N 3d skeleton joints, from a single R train using datasets with partial coverage, we could freeze monocular image I. The estimate is obtained from the in- ternal representations Rt. The 3d reconstruction module the model components for which we do not have ground- truth. Wecanalsosimultaneouslytrainallparameters, us- leverages information provided by the 2d semantic com- ponents St, incorporating the joint and body part labeling ing datasets with partial and complete coverage: those ex- featuremapsJt andBt. Additionally,weinsertatrainable amplesforwhichwehavegroundtruthatalllevelswillcon- functionct ,definedsimilarlytoct ,overimagefeatures,in tribute to each of the loss components, whereas examples D B order to obtain body reconstruction feature maps Dt. The forwhichwehavepartialgroundtruthwillonlycontribute totheircorrespondinglosses. modulefollowsasimilarflowasthepreviousones:itreuses estimates at earlier processing stages, Rt−1, together with St and Dt, in order to predict the reconstruction feature 4.Experiments maps Rt. The processing stages and dependencies of this In order to evaluate our method, we use 3 well-known moduleareshowninfig. 4. datasets, the Leeds Sports Dataset (LSP) [18], HumanEva Procedurally,wefirstfuseStandDt,thenapplyaseries [38]andHuman3.6M[16]. ofone(3×3×128)convolution,one(3×3×64)convo- TheLSPdatasetconsistsof2dposeannotatedRGBim- lution,one(1×1×64)convolution,followedbyapooling ages depicting sports people (athletics, badminton, base- layer (3×3) and a (1×1×16) convolution. The output ball, gymnastics, parkour, soccer, tennis, volleyball). We isconcatenatedwithRt−1andconvolvedbya(1×1×16) employboththeoriginalreleasecontaining1,000training kernelthatlearnstocombinethetwocomponents,produc- images and 1,000 testing images, as well as the extended ingtheestimateRt. Thefeaturemapsarethentransformed trainingreleasecontaininganadditional10,000images. to the desired dimensionality of the 3d human body skele- We use the first release of the HumanEva(-I) dataset. ton by means of a fully connected layer. The loss Lt is R Thisdatasetwascapturedbyanaccurate3dmotioncapture expressedasthemeanperjointpositionerror(MPJPE): system in a laboratory environment. There are six actions performed in total by three different subjects. As standard procedure [39, 4, 48, 42], we train our model on the train (cid:118) (cid:88)NR (cid:117)(cid:117)(cid:88)3 set and report results on the validation set, where we only LtR = (cid:116) (f(Rt,i,j)−R(cid:63)(i,j))2+(cid:15)2 (3) considerevery5thframeofthesequenceswalking,jogand i=1 j=1 boxforallthreesubjectsandasinglefrontalcameraview. Human80K is an 80,000 sample subset of the much where R(cid:63) are the 3d ground truth human joint positions, larger 3.6 million human pose dataset Human3.6M [16]. f(·) is the fully connected layer applied to Rt and (cid:15) is a Thedatasetiscapturedinalaboratoryenvironmentwitha smallconstantthatmakesthelossdifferentiable. motion capture setup, and contains daily activities and in- ThislosscomponentcanbetrainedwithdatafromHu- teraction scenarios (providing directions, discussion, eat- manEva and Human3.6M, but not from LSP or other 2d ing, activities while seating, greeting, taking photo, pos- datasets as these lack 3d ground truth information. Al- ing, making purchases, smoking, waiting, walking, sitting thoughthebackgroundsinHumanEvaandHuman3.6Mare on chair, talking on the phone, walking dog, walking to- not as challenging as those in LSP, the use of a multitask gether). The actions are performed by 11 actors, and cap- loss makes the complete 2d and 3d model competitive not turedby4RGBcameras. Thedatasetisfullyannotatedand onlyinalaboratorysettingbutalsointhewild(see§4and it contains RGB data, 2d body part labeling ground truth fig. 5). masks as well as accurate 2d and 3d pose reconstructions. Figure3.Ourmultitaskmultistage2dsemanticmoduleSt,combinessemanticbodypartlabelingBtand2dposeestimationJt. Figure4.Ourmultitaskmultistage3dreconstructionmoduleRt,combines3dprocessingwithinformationfromsemanticmodules,St. Human80Kconsistsof55,144trainingand24,416testing tioncapabilitiesofourarchitecture.WeuseT =6stagesin samplesfromHuman3.6M.Thesamplesfromeachoriginal our architecture for each sub-task component model (joint capturewereselectedsuchthatthedistancebetweeneach- detection, semantic segmentation, 3d reconstruction) and other,in3dspace,isnomorethan100mm. report results only for the final stage of each sub-task, as We use the Caffe [17] framework to train and test our itisthebestperformingaccordingtovalidation. architecture. The complete recognition and reconstruction 4.1.BodyPartLabeling pipeline takes approximately 400 ms per frame, in testing, on an Nvidia TITAN X (Pascal) 12GB GPU. We evaluate In order to evaluate the 2d body part labeling task, we the recognition (2d body part labeling) and 3d reconstruc- use the Human80K and LSP datasets. We introduce addi- tional annotations for LSP, as they are not available with the model trained on Human80K, perhaps due to the fact the original release of the dataset which only provides the that the test set distribution is better captured by the unal- 2d anatomical joints. We create human body part annota- tered Human80K training set. This is not the case when tionsforLSPbyusingtheannotated2dskeletonjointsand testing on LSP, as it can be seen from table 2. The best the kinematic tree. We produce circles for skeleton joints performance is obtained by the model trained jointly on andellipsesfortheindividualbodyparts. Wesetthemajor Human80K and LSP. This model is able to combine the axis to the size of the segment between the corresponding posevariabilityandpartlabelingannotationqualityofHu- jointsandestimateaconstantvalue,foreachbodypart,for man80Kwiththebackgroundandappearancevariabilityof the minor axis. The resulting masks (filtered by visual in- LSP, making it adequate for both laboratory settings and spection) are of lower quality than those available for Hu- for images of people captured against challenging back- man3.6M. The reason for augmenting LSP, is to enhance grounds. thevariabilityinhumanappearance, bodyproportionsand Theproposedmodelsforthe2dbodypartlabelingtask backgroundsinHuman80K.Thus,wewanttoleveragethe are trained using an initial learning rate set to 10−10 and goodqualitylabelingandposevariationofHuman80Kwith reduced at every 5 epochs by a constant factor γ = 0.33. thediverseappearancevariationsfoundinLSP. Duringthelearningprocess,thetrainingdataisaugmented ForevaluationonHuman80K,wecomparewiththere- by randomly rotating with angles between [−40◦,+40◦], sultsof[15]whichrepresentthestate-of-the-artforthistask scaling by a factor in the range [0.5,1.2] and horizontally onthisdataset. Theauthorsof[15]assumethatthesilhou- flipping,inordertoimprovethediversityofthetrainingset. etteoftheperson(thefigure-groundsegmentation)isgiven, Qualitative results of the semantic body part segmentation andperformthebodypartlabelingonlyontheforeground inchallengingimagesareshowninfig. 5. maskasaninferenceproblemoveratotalof24uniquela- 4.2.3DHumanPoseReconstruction bels. Differentfromthem,wedonotmakethisassumption andconsiderthebackgroundasanextraclass,thusbuilding For the evaluation of the 3d pose reconstruction task, amodelthatpredicts25classes. we use HumanEva-I and Human3.6M (specifically, Hu- To extend the evaluation on LSP, we consider multiple man80K),astheyarebothchallengingdatasets,containing scenarios: (a) training on Human80K with (b) fine-tuning complementaryposes,andofferingaccurate3dposeanno- on LSP and (c) training our architecture on LSP and Hu- tations. In all experiments, we use the same 3d evaluation man80K simultaneously and test on both Human80K and metricsasothermethodswecompareagainst, andaccord- LSP.Inoursetup,trainingusingonlyLSPwasnotfeasible, ingtothestandardpractice. Forthefollowingexperiments, asinmultipleexperimentsournetworkfailed toconverge. notethatour2dsemanticmodulecomponentsaretrainedon Thelabelingparameters(stagesB)areinitializedrandomly, datafromLSPandHuman80K,whereasthe3dcomponent while the parameters corresponding to the 2d joint detec- ofthemodulewaspre-trainedusingHuman80K. tion components, J, are initialized with the values of the WereportresultsontheHuman80Ktestsetandinvesti- networkpresentedin[46],trainedonMPI-IIandLSP. gatetheimpactofeachoftheinputfeaturesJ,BandDon The performance of our body part labeling models for the overall performance of the 3d reconstruction task (see the Human80K and LSP is given in tables 1 and 2, re- table 4). Notice that models that use the ‘D-pipeline’ for spectively. We use the same evaluation metrics as in [15], the 3d reconstruction part, would correspond to convolu- i.e. average accuracy for the pixels contained within the tionalnetworksthatfeed-forwardfromtheimagefeatures, groundtruthsilhouetteandclassnormalizedaverageaccu- butdonotleveragethesemanticsegmentationofthehuman racy. Basically, these two metrics apply only to the fore- bodyparts. ground classes. Additionally, we compute for all pixels, We observe that our fully integrated system, background and foreground, the average accuracy as well DMHS (J,B,D), achieves the lowest error of 63.35 R astheclassnormalizedaverageaccuracy. mm. Besides the interest in computing additional detailed From table 1 (Human80K testing), it can be noted that semantic human representations, it can be seen that by even though we solve a harder problem (by additionally feeding in the results from other tasks – the body joint estimating the figure-ground segmentation), the class nor- belief maps J and the body labeling probability maps B malizedaverageprecisiongreatlyimproves,withmorethan – the error is reduced considerably, from 77.56 mm (for 10%over[15]forthemodelstrainedon(a)Human80Kand a model based on feed-forward processing and infering (c) Human80K and LSP jointly. However, the model (b) additional2djointpositioninformation)to63.35mm. No- initializedwithparameterstrainedonHuman80K,butfine- tice also our significant gains with respect to the previous tunedonLSP,seemstohaveaperformancedropcausedby state-of-the-artresultsonHuman80K,asreportedby[15]. the low quality of the LSP body label annotations. In this For HumanEva-I, we use the DMHS (J,B,D) architec- R scenario, as expected, the best performance is obtained by turetrainedonHuman80Kandfine-tunedforafewepochs Metric DMHS -Human80K DMHS -LSP(ft) DMHS -Human80K&LSP [15] B B B Avg.Acc.(%)perpixel(fg) 79.00 53.31 75.84 73.99 Avg.Acc.(%)perpixel(fg+bg) 91.15 83.30 89.92 - Avg.Acc.(%)perclass(fg) 67.35 43.40 64.83 - Avg.Acc.(%)perclass(fg+bg) 68.56 45.61 66.13 53.10 Table1.BodypartlabelingresultsfortheHuman80Ktestset.Wereporttheperformanceofournetwork,trainedonHuman80K,LSP andbothdatasetsjointly. NotethatforLSP,thenetworkwaspre-trainedonHuman80K,asitotherwisefailedtoconvergetrainedonLSP alone. Wealsocomparewith[15],wherecomparisonsarepossibleonlyforaccuraciescomputedonthepersonforegroundasthemodel of[15]doesnotpredictthebackgroundlabel. Ourmodelisabletopredictthebackgroundclass,thuswereporttheperformanceforthe entireimage(includingthebackgroundclass)aswellasperformanceinsidethesilhouette(forclassesassociatedtohumanbodyparts). NotethatthebestperformanceonHuman80KisobtainedwiththenetworktrainedonHuman80K,perhapsduetothenoisyannotations addedtoLSP,andthenaturallymoresimilartrainingandtestingdistributionsofHuman80K. Metric DMHS -Human80K DMHS -LSP(ft) DMHS -Human80K&LSP B B B Avg.Acc.(%)perpixel(fg) 50.52 60.54 61.16 Avg.Acc.(%)perpixel(fg+bg) 85.58 91.08 91.09 Avg.Acc.(%)perclass(fg) 36.46 44.73 45.91 Avg.Acc.(%)perclass(fg+bg) 38.77 46.88 48.01 Table 2. Body part labeling results for the LSP dataset. All models are initialized on the Human80K dataset, as networks trained onlyusingLSPfailedtoconverge. InthiscasethemodelstrainedjointlyonHuman80KandLSPproducedthebestresults. Thisshows theimportanceofhavingaccuratebodypartlabelingannotations,obtainedinsimpleimagingscenariosbutforcomplexbodyposes,in combinationwithlessaccurateannotationsbutwithmorecomplexforeground-backgroundappearancevariations. Method Walking Avg. Jog Avg. Box Avg. All [39] 65.1 48.6 73.5 62.4 74.2 46.6 32.2 51.0 - - - - - [4] 45.4 28.3 62.3 45.33 55.1 43.2 37.4 45.2 42.5 64.0 69.3 58.6 49.7 [48] 35.8 32.4 41.6 36.6 46.6 41.4 35.4 41.1 - - - - - [42] 37.5 25.1 49.2 37.3 - - - - 50.5 61.7 57.5 56.6 - DMHS (J,B,D) 27.1 18.4 39.5 28.3 37.6 28.9 27.6 31.4 30.5 45.8 48.0 41.5 33.7 R Table3.3dmeanjointpositionerrorontheHumanEva-Idataset,computedbetweenourpredictedjointsandthegroundtruthafteran alignmentwitharigidtransformation.ComparisonswithothercompetingmethodsshowthatDMHSachievesstate-of-the-artperformance. (6) on the HumanEva-I training data. We use a subset of Model Avg.MPJPE(mm) thetrainingsetcontaining4,637samples. Thefine-tuning [15] 92.00 step on HumanEva-I is performed in order to compensate DMHSR(J) 128.05 forthedifferencesinmarkerpositioningwithrespecttoHu- DMHSR(D) 77.56 DMHS (J,B) 118.68 man80K,andinordertoaccountforthedifferentposedis- R DMHS (J,D) 72.00 tributionsw.r.tHuman80K.Intable3, wecompareourre- R DMHS (J,B,D) 63.35 sultsagainstseveralstate-of-the-artmethods.Wefollowthe R standardevaluationprocedurein[39,4,48,42]andsample Table 4. 3d mean joint position error on the Human80K datafromthevalidationsetforwalking,joggingandboxing dataset.Differentcomponentsofourmodelarecomparedto[15]. activities. Weuseasinglecamera. Weobtainconsiderable performancegainswithrespecttothepreviousstate-of-the- ages. Oursystemestimatesthefigure-groundsegmentation artmethodsonHumanEva,eventhoughweonlyuseasmall and detects the human body joints, semantically identifies subsetoftheavailabletrainingset. the body parts, and reconstructs the 2d and 3d pose of the Forthismodel,weuseaninitiallearningratesetto10−7 person. Thedesignofrecurrentmulti-tasklossfunctionsat and reduce it at every 5 epochs by a constant factor γ = multiplestagesofprocessingsupportstheprincipledcom- 0.66. Qualitativeresultsofourmethodcanbeseeninfig. 5 bination of the strengths of different 2d and 3d datasets, andfig. 6. withoutbeinglimitedbytheirdifferentweaknesses. Inex- periments we perform ablation studies, evaluate the effect 5.Conclusions of various types of training data in the multitask loss, and demonstrate that state-of-the-art-results can be achieved at Wehaveproposedadeepmultitaskarchitectureforfully all processing levels. We show that, even in the wild, our automatic 2d and 3d human sensing (DMHS), including monocularRGBarchitectureisperceptuallycompetitiveto recognition and reconstruction, based on monocular im- state-of-theartcommercialRGB-Dsystems. Figure5.RecognitionandreconstructionresultsforimagesfromHuman3.6MandLSP.Foreachimageweshowthe2dposeestimatethe semanticsegmentationofbodypartsandthe3dposeestimation. Noticethedifficultyofbackgroundsandposesandthefactthatthe2d and3dmodelsgeneralizewell. Noticethatinourarchitecture,errorsduringtheearlystagesofprocessing(2dposeestimation)canbe correctedlater,duringe.g.semanticbodypartsegmentationor3dposeestimation. Figure6.QualitativecomparisonsforsegmentationandreconstructionbetweenourRGBmodel(toprow)andtheonesofacommercial RGB-DKinectforXboxOnesystem(bottomrow).Ourmodelproducesaccuratefigure-groundsegmentations,bodypartlabeling,and3d reconstructionforsomechallengingposes. 6.Acknowledgment [19] S.JohnsonandM.Everingham. Clusteredposeandnonlin- earappearancemodelsforhumanposeestimation.InBMVC, This work was supported in part by CNCS-UEFISCDI 2010. underPCE-2011-3-0438,JRP-RO-FR-2014-16. [20] M. Kiefel and P. V. Gehler. Human pose estimation with fieldsofparts. InECCV,2014. References [21] S. Li and A. B. Chan. 3d human pose estimation from [1] A.AgarwalandB.Triggs. Recovering3dhumanposefrom monocularimageswithdeepconvolutionalneuralnetwork. monocularimages. PAMI,28(1):44–58,2006. InACCV,2014. [2] I.AkhterandM.J.Black. Pose-conditionedjointanglelim- [22] S.Li,W.Zhang,andA.B.Chan. Maximum-marginstruc- itsfor3dhumanposereconstruction. InCVPR,2015. tured learning with deep networks for 3d human pose esti- [3] M.Andriluka, L.Pishchulin, P.Gehler, andB.Schiele. 2d mation. InICCV,2015. humanposeestimation: Newbenchmarkandstateoftheart [23] X.Liang,C.Xu,X.Shen,J.Yang,S.Liu,J.Tang,L.Lin,and analysis. InCVPR,2014. S. Yan. Human parsing with contextualized convolutional [4] L. Bo and C. Sminchisescu. Twin gaussian processes for neuralnetwork. InICCV,2015. structuredprediction. IJCV,87(1-2):28–52,2010. [24] I.Lifshitz,E.Fetaya,andS.Ullman.Humanposeestimation [5] L.Bo,C.Sminchisescu,A.Kanaujia,andD.Metaxas. Fast usingdeepconsensusvoting. InECCV,2016. algorithms for large scale conditional 3d prediction. In [25] J.Long, E.Shelhamer, andT.Darrell. Fullyconvolutional CVPR,2008. networksforsemanticsegmentation. InCVPR,2015. [6] F. Bogo, A. Kanazawa, C. Lassner, P. Gehler, J. Romero, [26] G.MoriandJ.Malik. Recovering3dhumanbodyconfigu- andM.J.Black. Keepitsmpl: Automaticestimationof3d rationsusingshapecontexts.PAMI,28(7):1052–1062,2006. humanposeandshapefromasingleimage. InECCV,2016. [27] A. Newell, K. Yang, and J. Deng. Stacked hourglass net- [7] L.Bourdev,S.Maji,T.Brox,andJ.Malik. Detectingpeo- worksforhumanposeestimation. InECCV,2016. pleusingmutuallyconsistentposeletactivations. InECCV, [28] T.Pfister, J.Charles, andA.Zisserman. Flowingconvnets 2010. forhumanposeestimationinvideos. InICCV,2015. [8] A.BulatandG.Tzimiropoulos. Humanposeestimationvia convolutionalpartheatmapregression. InECCV,2016. [29] L.Pishchulin,M.Andriluka,P.Gehler,andB.Schiele.Pose- letconditionedpictorialstructures. InCVPR,2013. [9] J.Carreira, P.Agrawal, K.Fragkiadaki, andJ.Malik. Hu- manposeestimationwithiterativeerrorfeedback. InCVPR, [30] G.Pons-Moll, D.J.Fleet, andB.Rosenhahn. Posebitsfor 2016. monocularhumanposeestimation. InCVPR,2014. [10] X.ChenandA.L.Yuille. Articulatedposeestimationbya [31] V.Ramakrishna,T.Kanade,andY.Sheikh. Reconstructing graphicalmodelwithimagedependentpairwiserelations.In 3dhumanposefrom2dimagelandmarks. InECCV,2012. NIPS,2014. [32] V. Ramakrishna, D. Munoz, M. Hebert, J. A. Bagnell, and [11] P.F.Felzenszwalb,R.B.Girshick,D.McAllester,andD.Ra- Y.Sheikh. Posemachines: Articulatedposeestimationvia manan. Objectdetectionwithdiscriminativelytrainedpart- inferencemachines. InECCV,2014. basedmodels. PAMI,32(9):1627–1645,2010. [33] H.Rhodin,N.Robertini,D.Casas,C.Richardt,H.-P.Seidel, [12] A. Herna´ndez-Vela, S. Sclaroff, and S. Escalera. Poselet- andC.Theobalt.Generalautomatichumanshapeandmotion based contextual rescoring for human pose estimation via captureusingvolumetriccontourcues. InECCV,2016. pictorialstructures. IJCV,118:49–64,2016. [34] R.RosalesandS.Sclaroff. Learningbodyposeviaspecial- [13] E.Insafutdinov,L.Pishchulin,B.Andres,M.Andriluka,and izedmaps. InNIPS,2001. B.Schiele. Deepercut: Adeeper,stronger,andfastermulti- [35] B.Rosenhahn, R.Klette, andD.Metaxas, editors. Human personposeestimationmodel. InECCV,2016. Motion,Understanding,Modelling,CaptureandAnimation, [14] C.Ionescu,L.Bo,andC.Sminchisescu. Structuralsvmfor volume36. SpringerVerlag,2008. visuallocalizationandcontinuousstateestimation.InICCV, [36] G. Shakhnarovich, P. A. Viola, and T. Darrell. Fast pose 2009. estimationwithparameter-sensitivehashing.InICCV,2003. [15] C. Ionescu, J. Carreira, and C. Sminchisescu. Iterated [37] L. Sigal, A. Balan, and M. J. Black. Combined discrimi- second-orderlabelsensitivepoolingfor3dhumanposeesti- native and generative articulated pose and non-rigid shape mation. InCVPR,2014. estimation. InNIPS,2007. [16] C.Ionescu,D.Papava,V.Olaru,andC.Sminchisescu. Hu- man3.6M:LargeScaleDatasetsandPredictiveMethodsfor [38] L. Sigal, A. O. Balan, and M. J. Black. Humaneva: Syn- 3DHumanSensinginNaturalEnvironments. PAMI,2014. chronizedvideoandmotioncapturedatasetandbaselineal- gorithmforevaluationofarticulatedhumanmotion. IJCV, [17] Y.Jia,E.Shelhamer,J.Donahue,S.Karayev,J.Long,R.Gir- 87(1-2):4–27,2010. shick, S. Guadarrama, and T. Darrell. Caffe: Convolu- tionalarchitectureforfastfeatureembedding.arXivpreprint [39] E. Simo-Serra, A. Quattoni, C. Torras, and F. Moreno- arXiv:1408.5093,2014. Noguer. Ajointmodelfor2dand3dposeestimationfroma [18] S.JohnsonandM.Everingham. Clusteredposeandnonlin- singleimage. InCVPR,2013. earappearancemodelsforhumanposeestimation.InBMVC, [40] C.SminchisescuandB.Triggs. Kinematicjumpprocesses 2010. formonocular3dhumantracking. InCVPR,2003. [41] S. Tang, B. Andres, M. Andriluka, and B. Schiele. Multi- persontrackingbymulticutanddeepmatching. InECCV, 2016. [42] B.Tekin,A.Rozantsev,V.Lepetit,andP.Fua.Directpredic- tionof3dbodyposesfrommotioncompensatedsequences. arXivpreprintarXiv:1402.0119,2015. [43] J.J.Tompson,A.Jain,Y.LeCun,andC.Bregler.Jointtrain- ing of a convolutional network and a graphical model for humanposeestimation. InNIPS,2014. [44] A.ToshevandC.Szegedy. Deeppose: Humanposeestima- tionviadeepneuralnetworks. InCVPR,2014. [45] C.Wang,Y.Wang,Z.Lin,A.L.Yuille,andW.Gao. Robust estimationof3dhumanposesfromasingleimage.InCVPR, 2014. [46] S.Wei,V.Ramakrishna,T.Kanade,andY.Sheikh. Convo- lutionalposemachines. InCVPR,June2016. [47] Y.YangandD.Ramanan. Articulatedhumandetectionwith flexiblemixturesofparts. PAMI,35(12):2878–2890,2013. [48] H.Yasin,U.Iqbal,B.Kruger,A.Weber,andJ.Gall.Adual- sourceapproachfor3dposeestimationfromasingleimage. InCVPR,June2016. [49] X.Zhou,M.Zhu,S.Leonardos,K.Derpanis,andK.Dani- ilidis.Sparsenessmeetsdeepness:3dhumanposeestimation frommonocularvideo. InCVPR,2016.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.