ebook img

Convolutional Pose Machines PDF

4 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Convolutional Pose Machines

Convolutional Pose Machines Shih-EnWei VarunRamakrishna TakeoKanade YaserSheikh [email protected] [email protected] [email protected] [email protected] TheRoboticsInstitute CarnegieMellonUniversity 6 1 0 Abstract 2 r Pose Machines provide a sequential prediction frame- p workforlearningrichimplicitspatialmodels. Inthiswork A we show a systematic design for how convolutional net- 2 works can be incorporated into the pose machine frame- Input Image (a) Stage 1 (b) Stage 2 (c) Stage 3 1 workforlearningimagefeaturesandimage-dependentspa- tialmodelsforthetaskofposeestimation. Thecontribution Figure1:AConvolutionalPoseMachineconsistsofasequenceofpre- ] of this paper is to implicitly model long-range dependen- dictorstrainedtomakedensepredictionsateachimagelocation.Herewe V showtheincreasinglyrefinedestimatesforthelocationoftherightelbow cies between variables in structured prediction tasks such C ineachstageofthesequence. (a)Predictingfromlocalevidenceoften asarticulatedposeestimation.Weachievethisbydesigning causesconfusion. (b)Multi-partcontexthelpsresolveambiguity. (c)Ad- s. a sequential architecture composed of convolutional net- ditionaliterationshelpconvergetoacertainsolution. c works that directly operate on belief maps from previous [ stages,producingincreasinglyrefinedestimatesforpartlo- of each part. At each stage in a CPM, image features and 4 cations,withouttheneedforexplicitgraphicalmodel-style the belief maps produced by the previous stage are used v inference. Ourapproachaddressesthecharacteristicdiffi- as input. The belief maps provide the subsequent stage 4 cultyofvanishinggradientsduringtrainingbyprovidinga an expressive non-parametric encoding of the spatial un- 3 naturallearningobjectivefunctionthatenforcesintermedi- certainty of location for each part, allowing the CPM to 1 atesupervision,therebyreplenishingback-propagatedgra- learn rich image-dependent spatial models of the relation- 0 dientsandconditioningthelearningprocedure. Wedemon- ships between parts. Instead of explicitly parsing such be- 0 stratestate-of-the-artperformanceandoutperformcompet- liefmapseitherusinggraphicalmodels[28,38,39]orspe- . 2 ing methods on standard benchmarks including the MPII, cialized post-processing steps [38, 40], we learn convolu- 0 LSP,andFLICdatasets. tionalnetworksthatdirectlyoperateonintermediatebelief 6 mapsandlearnimplicitimage-dependentspatialmodelsof 1 therelationshipsbetweenparts.Theoverallproposedmulti- : v 1.Introduction stagearchitectureisfullydifferentiableandthereforecanbe i X trainedinanend-to-endfashionusingbackpropagation. WeintroduceConvolutionalPoseMachines(CPMs)for At a particular stage in the CPM, the spatial context of r the task of articulated pose estimation. CPMs inherit the a part beliefs provide strong disambiguating cues to a sub- benefitsoftheposemachine[29]architecture—theimplicit sequent stage. As a result, each stage of a CPM produces learning of long-range dependencies between image and beliefmapswithincreasinglyrefinedestimatesfortheloca- multi-part cues, tight integration between learning and in- tionsofeachpart(seeFigure1). Inordertocapturelong- ference, a modular sequential design—and combine them rangeinteractionsbetweenparts,thedesignofthenetwork with the advantages afforded by convolutional architec- ineachstageofoursequentialpredictionframeworkismo- tures: the ability to learn feature representations for both tivated by the goal of achieving a large receptive field on image and spatial context directly from data; a differen- both the image and the belief maps. We find, through ex- tiable architecture that allows for globally joint training periments,thatlargereceptivefieldsonthebeliefmapsare with backpropagation; and the ability to efficiently handle crucialforlearninglongrangespatialrelationshipsandre- largetrainingdatasets. CPMs consist of a sequence of convolutional networks mapsdescribedarecloselyrelatedtobeliefsproducedinmessagepassing that repeatedly produce 2D belief maps 1 for the location inferenceingraphicalmodels. Theoverallarchitecturecanbeviewedas anunrolledmean-fieldmessagepassinginferencealgorithm[31]thatis 1Weusethetermbelief inaslightlyloosesense,howeverthebelief learnedend-to-endusingbackpropagation. 1 sultinimprovedaccuracy. ships. Thesemethodsusuallyhavetorelyonapproximate Composing multiple convolutional networks in a CPM inference during both learning and at test time, and there- results in an overall network with many layers that is at forehavetotradeoffaccuratemodelingofspatialrelation- risk of the problem of vanishing gradients [4, 5, 10, 12] shipswithmodelsthatallowefficientinference,oftenwith during learning. This problem can occur because back- asimpleparametricformtoallowforfastinference.Incon- propagatedgradientsdiminishinstrengthastheyareprop- trast,methodsbasedonasequentialpredictionframework agatedthroughthemanylayersofthenetwork. Whilethere [29] learn an implicit spatial model with potentially com- existsrecentwork2whichshowsthatsupervisingverydeep plex interactions between variables by directly training an networks at intermediate layers aids in learning [20, 36], inferenceprocedure,asin[22,25,31,41]. theyhavemostlybeenrestrictedtoclassificationproblems. Therehasbeenarecentsurgeofinterestinmodelsthat Inthiswork,weshowhowforastructuredpredictionprob- employconvolutionalarchitecturesforthetaskofarticu- lemsuchasposeestimation,CPMsnaturallysuggestasys- lated pose estimation [6, 7, 23, 24, 28, 38, 39]. Toshev et tematicframeworkthatreplenishesgradientsandguidesthe al. [40]taketheapproachofdirectlyregressingtheCarte- network to produce increasingly accurate belief maps by siancoordinatesusingastandardconvolutionalarchitecture enforcingintermediatesupervisionperiodicallythroughthe [18]. Recentworkregressesimagetoconfidencemaps,and network.Wealsodiscussdifferenttrainingschemesofsuch resorttographicalmodels,whichrequirehand-designeden- asequentialpredictionarchitecture. ergyfunctionsorheuristicinitializationofspatialprobabil- Our main contributions are (a) learning implicit spatial ity priors, to remove outliers on the regressed confidence models via a sequential composition of convolutional ar- maps. Someofthemalsoutilizeadedicatednetworkmod- chitecturesand(b)asystematicapproachtodesigningand uleforprecisionrefinement[28,38]. Inthiswork,weshow training such an architecture to learn both image features theregressedconfidencemapsaresuitabletobeinputtedto and image-dependent spatial models for structured predic- further convolutional networks with large receptive fields tion tasks, without the need for any graphical model style to learn implicit spatial dependencies without the use of inference. We achieve state-of-the-art results on standard hand designed priors, and achieve state-of-the-art perfor- benchmarks including the MPII, LSP, and FLIC datasets, mance over all precision region without careful initializa- andanalyzetheeffectsofjointlytrainingamulti-stagedar- tion and dedicated precision refinement. Pfister et al. [24] chitecturewithrepeatedintermediatesupervision. also used a network module with large receptive field to capture implicit spatial models. Due to the differentiable 2.RelatedWork nature of convolutions, our model can be globally trained, whereTompsonetal. [39]andStewardetal. [34]alsodis- The classical approach to articulated pose estimation is cussedthebenefitofjointtraining. thepictorialstructuresmodel[2,3,9,14,26,27,30,43] Carreiraetal.[6]trainadeepnetworkthatiterativelyim- inwhichspatialcorrelationsbetweenpartsofthebodyare provespartdetectionsusingerrorfeedbackbutuseacarte- expressed as a tree-structured graphical model with kine- sianrepresentationasin[40]whichdoesnotpreservespa- matic priors that couple connected limbs. These methods tial uncertainty and results in lower accuracy in the high- have been successful on images where all the limbs of the precisionregime. Inthiswork,weshowhowthesequential person are visible, but are prone to characteristic errors predictionframeworktakesadvantageofthepreservedun- such as double-counting image evidence, which occur be- certainty in the confidence maps to encode the rich spatial cause of correlations between variables that are not cap- context, withenforcingtheintermediatelocalsupervisions tured by a tree-structured model. The work of Kiefel et toaddresstheproblemofvanishinggradients. al. [17] is based on the pictorial structures model but dif- fers in the underlying graph representation. Hierarchical 3.Method models [35, 37] represent the relationships between parts atdifferentscalesandsizesinahierarchicaltreestructure. 3.1.PoseMachines The underlying assumption of these models is that larger parts (that correspond to full limbs instead of joints) can Wedenotethepixellocationofthep-thanatomicalland- often have discriminative image structure that can be eas- mark (which we refer to as a part), Yp ∈ Z ⊂ R2, where ier to detect and consequently help reason about the loca- Z is the set of all (u,v) locations in an image. Our goal tion of smaller, harder-to-detect parts. Non-tree models is to predict the image locations Y = (Y1,...,YP) for [8, 16, 19, 33, 42] incorporate interactions that introduce all P parts. A pose machine [29] (see Figure 2a and 2b) loops to augment the tree structure with additional edges consists of a sequence of multi-class predictors, gt(·), that that capture symmetry, occlusion and long-range relation- aretrainedtopredictthelocationofeachpartineachlevel of the hierarchy. In each stage t ∈ {1...T}, the classi- 2Newresultshaveshownthatusingskipconnectionswithidentitymap- fiersg predictbeliefsforassigningalocationtoeachpart t pings[11]inso-calledresidualunitsalsoaidsinaddressingvanishinggra- Y = z, ∀z ∈ Z,basedonfeaturesextractedfromtheim- dientsin“verydeep”networks. Weviewthismethodascomplementary agpe at the location z denoted by x ∈ Rd and contextual anditcanbenotedthatourmodulararchitectureeasilyallowsustoreplace z eachstagewiththeappropriateresidualnetworkequivalent. information from the preceding classifier in the neighbor- Convolutional (a) Stage 1 (b) Stage 2 Pose Machines � b x0 b x0 b (T–stage) 1 2 T x g g g P Pooling 1 2 T C Convolution 2 T (c) Stage 1 Loss Input (d) Stage 2 Loss f1 Image 9C⇥9 2P⇥ 9C⇥9 2P⇥ 9C⇥9 2P⇥ 5C⇥5 � f2 x h w 3 ⇥ ⇥ x 0 hII⇥nmpwau⇥gte3 9C⇥9 2P⇥ 9C⇥9 2P⇥ 9C⇥9 2P⇥ 5C⇥5 9C⇥9 1C⇥1 1C⇥1 ⇥h(P0⇥+w01) 11C⇥1111C⇥1111C⇥11 1C⇥1 1C⇥1 ⇥h(P0⇥+w01) 9 9 26 26 60 60 96 96 160 160 240 240 320 320 400 400 ⇥ ⇥ ⇥ ⇥ ⇥ ⇥ ⇥ ⇥ (e) E↵ective Receptive Field Figure2: ArchitectureandreceptivefieldsofCPMs. WeshowaconvolutionalarchitectureandreceptivefieldsacrosslayersforaCPMwithanyT stages.Theposemachine[29]isshownininsets(a)and(b),andthecorrespondingconvolutionalnetworksareshownininsets(c)and(d).Insets(a)and(c) showthearchitecturethatoperatesonlyonimageevidenceinthefirststage.Insets(b)and(d)showsthearchitectureforsubsequentstages,whichoperate bothonimageevidenceaswellasbeliefmapsfromprecedingstages. Thearchitecturesin(b)and(d)arerepeatedforallsubsequentstages(2toT). The networkislocallysupervisedaftereachstageusinganintermediatelosslayerthatpreventsvanishinggradientsduringtraining.Belowininset(e)weshow theeffectivereceptivefieldonanimage(centeredatleftknee)ofthearchitecture,wherethelargereceptivefieldenablesthemodeltocapturelong-range spatialdependenciessuchasthosebetweenheadandknees.(Bestviewedincolor.) hoodaroundeachY instaget.Aclassifierinthefirststage forestsforprediction({g }),fixedhand-craftedimagefea- p t t=1,thereforeproducesthefollowingbeliefvalues: turesacrossallstages(x(cid:48) =x),andfixedhand-craftedcon- text feature maps (ψ (·)) to capture spatial context across t g1(xz)→{bp1(Yp =z)}p∈{0...P}, (1) allstages. wherebp(Y =z)isthescorepredictedbytheclassifierg 3.2.ConvolutionalPoseMachines 1 p 1 forassigningthepthpartinthefirststageatimagelocation We show how the prediction and image feature compu- z. We represent all the beliefs of part p evaluated at every tationmodulesofaposemachinecanbereplacedbyadeep locationz = (u,v)T intheimageasbp ∈ Rw×h,wherew t convolutionalarchitectureallowingforbothimageandcon- and h are the width and height of the image, respectively. textual feature representations to be learned directly from Thatis, data. Convolutional architectures also have the advantage bp[u,v]=bp(Y =z). (2) t t p of being completely differentiable, thereby enabling end- For convenience, we denote the collection of belief maps to-end joint training of all stages of a CPM. We describe forallthepartsasb ∈Rw×h×(P+1) (P partsplusonefor ourdesignforaCPMthatcombinestheadvantagesofdeep t background). convolutionalarchitectureswiththeimplicitspatialmodel- In subsequent stages, the classifier predicts a belief for ingaffordedbytheposemachineframework. assigning a location to each part Y = z, ∀z ∈ Z, based p on (1) features of the image data xt ∈ Rd again, and (2) z 3.2.1 KeypointLocalizationUsingLocalImage contextualinformationfromthepreceedingclassifierinthe Evidence neighborhoodaroundeachY : p Thefirststageofaconvolutionalposemachinepredictspart gt(x(cid:48)z,ψt(z,bt−1))→{bpt(Yp =z)}p∈{0...P+1}, (3) beliefsfromonlylocalimageevidence.Figure2cshowsthe networkstructureusedforpartdetectionfromlocalimage where ψ (·) is a mapping from the beliefs b to con- evidenceusingadeepconvolutionalnetwork.Theevidence t>1 t−1 textfeatures.Ineachstage,thecomputedbeliefsprovidean is local because the receptive field of the first stage of the increasingly refined estimate for the location of each part. network is constrained to a small patch around the output Notethatweallowimagefeaturesx(cid:48) forsubsequentstage pixel location. We use a network structure composed of z tobedifferentfromtheimagefeatureusedinthefirststage five convolutional layers followed by two 1 × 1 convolu- x.Theposemachineproposedin[29]usedboostedrandom tional layers which results in a fully convolutional archi- stage1 stage2 stage3 FLIC Wrists: Effect of Receptive Field FLIC Elbows: Effect of Receptive Field 0.85 0.85 Accuracy0.07.58 Accuracy0.07.58 Right Wrist Right Elbow 0.7 Left Wrist 0.7 Left Elbow 50 100 150 200 250 300 50 100 150 200 250 300 Effective Receptive Field (Pixels) Effective Receptive Field (Pixels) R.Elbow R.Shoulder Neck Head R.Elbow R.Elbow Figure4: Largereceptivefieldsforspatialcontext. Weshowthatnet- Figure3:Spatialcontextfrombeliefmapsofeasier-to-detectpartscan workswithlargereceptivefieldsareeffectiveatmodelinglong-rangespa- providestrongcuesforlocalizingdifficult-to-detectparts.Thespatialcon- tialinteractionsbetweenparts. Notethattheseexperimentsareoperated texts from shoulder, neck andhead can help eliminatewrong (red) and withsmallernormalizedimagesthanourbestsetting. strengthencorrect(green)estimationsonthebeliefmapofrightelbowin thesubsequentstages. (asopposedtospecifyingpotentialfunctionsinagraphical model),theconvolutionallayersinthesubsequentstageal- tecture [21]. In practice, to achieve certain precision, we lowtheclassifiertofreelycombinecontextualinformation normalizeinputcroppedimagestosize368×368(seeSec- by picking the most predictive features. The belief maps tion 4.2 for details), and the receptive field of the network from the first stage are generated from a network that ex- shown above is 160×160 pixels. The network can effec- amined the image locally with a small receptive field. In tively be viewed as sliding a deep network across an im- the second stage, we design a network that drastically in- age and regressing from the local image evidence in each creasestheequivalentreceptivefield. Largereceptivefields 160×160imagepatchtoaP +1sizedoutputvectorthat can be achieved either by pooling at the expense of preci- representsascoreforeachpartatthatimagelocation. sion,increasingthekernelsizeoftheconvolutionalfiltersat the expense of increasing the number of parameters, or by increasingthenumberofconvolutionallayersattheriskof 3.2.2 SequentialPredictionwithLearnedSpatial encounteringvanishinggradientsduringtraining. Ournet- ContextFeatures workdesignandcorrespondingreceptivefieldforthesub- While the detection rate on landmarks with consistent ap- sequentstages(t≥2)isshowninFigure2d. Wechooseto pearance,suchastheheadandshoulders,canbefavorable, usemultipleconvolutionallayerstoachievelargereceptive the accuracies are often much lower for landmarks lower fieldonthe8×downscaledheatmaps,asitallowsustobe downthekinematicchainofthehumanskeletonduetotheir parsimonious with respect to the number of parameters of large variance in configuration and appearance. The land- themodel. Wefoundthatourstride-8networkperformsas scapeofthebeliefmapsaroundapartlocation,albeitnoisy, wellasastride-4oneevenathighprecisionregion,whileit can, however, be very informative. Illustrated in Figure 3, makes us easier to achievelarger receptive fields. We also when detecting challenging parts such as right elbow, the repeatsimilarstructureforimagefeaturemapstomakethe beliefmapforrightshoulderwithasharppeakcanbeused spatialcontextbeimage-dependentandallowerrorcorrec- asastrongcue. Apredictorinsubsequentstages(g )can tion,followingthestructureofposemachine. t>1 usethespatialcontext(ψt>1(·))ofthenoisybeliefmapsin We find that accuracy improves with the size of the re- a region around the image location z and improve its pre- ceptivefield. InFigure4weshowtheimprovementinac- dictions by leveraging the fact that parts occur in consis- curacyontheFLICdataset[32]asthesizeofthereceptive tentgeometricconfigurations. Inthesecondstageofapose fieldontheoriginalimageisvariedbyvaryingthearchitec- machine, the classifier g2 accepts as input the image fea- ture without significantly changing the number of param- tures x2 and features computed on the beliefs via the fea- eters, through a series of experimental trials on input im- z ture function ψ for each of the parts in the previous stage. ages normalized to a size of 304 × 304. We see that the The feature function ψ serves to encode the landscape of accuracyimprovesastheeffectivereceptivefieldincreases, the belief maps from the previous stage in a spatial region and starts to saturate around 250 pixels, which also hap- around the location z of the different parts. For a convo- pens to be roughly the size of the normalized object. This lutionalposemachine,wedonothaveanexplicitfunction improvementinaccuracywithreceptivefieldsizesuggests thatcomputescontextfeatures. Instead,wedefineψ asbe- thatthenetworkdoesindeedencodelongrangeinteractions ing the receptive field of the predictor on the beliefs from between parts and that doing so is beneficial. In our best thepreviousstage. performing setting in Figure 2, we normalize cropped im- The design of the network is guided by achieving a re- agesintoalargersizeof368×368pixelsforbetterpreci- ceptivefieldattheoutputlayerofthesecondstagenetwork sion, and the receptive field of the second stage output on thatislargeenoughtoallowthelearningofpotentiallycom- thebeliefmapsofthefirststageissetto31×31,whichis plex and long-range correlations between parts. By sim- equivalently400×400pixelsontheoriginalimage,where plysupplyingfeaturesontheoutputsofthepreviousstage theradiuscanusuallycoveranypairoftheparts.Withmore HistogramsofGradientMagnitudeDuringTraining Supervision Supervision Supervision Input Stage 1 Stage 2 Stage 3 Output Layer 1 Layer 3 Layer 6 Layer 7 Layer 9 Layer 12 Layer 13 Layer 15 Layer 18 104 h 1103 poc102 E101 100 104 h 2103 poc102 E101 100 104 h 3103 poc102 E101 100 −0.5 0.0 0.5 −0.5 0.0 0.5 −0.5 0.0 0.5 −0.5 0.0 0.5 −0.5 0.0 0.5 −0.5 0.0 0.5 −0.5 0.0 0.5 −0.5 0.0 0.5 −0.5 0.0 0.5 Gradient (× 10−3) With Intermediate Supervision Without Intermediate Supervision Figure5: Intermediatesupervisionaddressesvanishinggradients. Wetrackthechangeinmagnitudeofgradientsinlayersatdifferentdepthsinthe network,acrosstrainingepochs,formodelswithandwithoutintermediatesupervision.Weobservethatforlayersclosertotheoutput,thedistributionhas alargevarianceforbothwithandwithoutintermediatesupervision;howeveraswemovefromtheoutputlayertowardstheinput,thegradientmagnitude distributionpeakstightlyaroundzerowithlowvariance(thegradientsvanish)forthemodelwithoutintermediatesupervision.Forthemodelwithinterme- diatesupervisionthedistributionhasamoderatelylargevariancethroughoutthenetwork.Atlatertrainingepochs,thevariancesdecreaseforalllayersfor themodelwithintermediatesupervisionandremaintightlypeakedaroundzeroforthemodelwithoutintermediatesupervision.(Bestviewedincolor) stages,theeffectivereceptivefieldisevenlarger. Inthefol- byaddingthelossesateachstageandisgivenby: lowingsectionweshowourresultsfromupto6stages. T (cid:88) F = f . (5) 3.3.LearninginConvolutionalPoseMachines t t=1 Thedesigndescribedaboveforaposemachineresultsin Weusestandardstochasticgradientdescendtojointlytrain adeeparchitecturethatcanhavealargenumberoflayers. alltheT stagesinthenetwork. Tosharetheimagefeature Trainingsuchanetworkwithmanylayerscanbeproneto x(cid:48)acrossallsubsequentstages,wesharetheweightsofcor- theproblemofvanishinggradients[4,5,10]where,asob- respondingconvolutionallayers(seeFigure2)acrossstages servedbyBradley[5]andBengioetal. [10],themagnitude t≥2. ofback-propagatedgradientsdecreasesinstrengthwiththe numberofintermediatelayersbetweentheoutputlayerand 4.Evaluation theinputlayer. 4.1.Analysis Fortunately, the sequential prediction framework of the pose machine provides a natural approach to training our Addressing vanishing gradients. The objective in Equa- deeparchitecturethataddressesthisproblem.Eachstageof tion5describesadecomposablelossfunctionthatoperates theposemachineistrainedtorepeatedlyproducethebelief ondifferentpartsofthenetwork(seeFigure2).Specifically, maps for the locations of each of the parts. We encourage eachterminthesummationisappliedtothenetworkafter thenetworktorepeatedlyarriveatsucharepresentationby each stage t effectively enforcing supervision in interme- defining a loss function at the output of each stage t that diatestagesthroughthenetwork. Intermediatesupervision minimizes the l2 distance between the predicted and ideal hastheadvantagethat,eventhoughthefullarchitecturecan belief maps for each part. The ideal belief map for a part havemanylayers,itdoesnotfallpreytothevanishinggra- p is written as bp∗(Yp = z), which are created by putting dient problem as the intermediate loss functions replenish Gaussianpeaksatgroundtruthlocationsofeachbodypart thegradientsateachstage. p. The cost function we aim to minimize at the output of Weverifythisclaimbyobservinghistogramsofgradient eachstageateachlevelisthereforegivenby: magnitude(seeFigure5)atdifferentdepthsinthearchitec- tureacrosstrainingepochsformodelswithandwithoutin- P+1 termediate supervision. In early epochs, as we move from (cid:88) (cid:88) ft = (cid:107)bpt(z)−bp∗(z)(cid:107)22. (4) theoutputlayertotheinputlayer,weobserveonthemodel p=1z∈Z without intermediate supervision, the gradient distribution is tightly peaked around zero because of vanishing gradi- Theoverallobjectiveforthefullarchitectureisobtained ents. Themodelwithintermediatesupervisionhasamuch PCK total, LSP PC PCK total, LSP PC PCK total, LSP PC 100 100 100 Ours 6−Stage 90 90 90 Ramakrishna et al., ECCV’14 80 80 80 % % % e 70 e 70 e 70 at 60 at 60 at 60 n r 50 n r 50 n r 50 o o o cti 40 cti 40 cti 40 Ours 1−Stage ete 30 ete 30 ete 30 Ours 2−Stage D D (i) Ours 3−Stage D Ours 3−Stage 20 20 (ii) Ours 3−Stage stagewise (sw) 20 Ours 4−Stage 10 10 (iii) Ours 3−Stage sw + finetune 10 Ours 5−Stage (iv) Ours 3−Stage no IS Ours 6−Stage 0 0 0 0 0.05 0.1 0.15 0.2 0 0.05 0.1 0.15 0.2 0 0.05 0.1 0.15 0.2 Normalized distance Normalized distance Normalized distance (a) (b) (c) Figure6:Comparisonson3-stagearchitecturesontheLSPdataset(PC):(a)ImprovementsoverPoseMachine.(b)Comparisonsbetweenthedifferent trainingmethods.(c)Comparisonsacrosseachnumberofstagesusingjointtrainingfromscratchwithintermediatesupervision. largervarianceacrossalllayers,suggestingthatlearningis 4.2.DatasetsandQuantitativeAnalysis indeedoccurringinallthelayersthankstointermediatesu- In this section we present our numerical results in var- pervision. We also notice that as training progresses, the ious standard benchmarks including the MPII, LSP, and variance in the gradient magnitude distributions decreases FLICdatasets. Tohavenormalizedinputsamplesof368× pointingtomodelconvergence. 368fortraining,wefirstresizetheimagestoroughlymake the samples into the same scale, and then crop or pad the Benefit of end-to-end learning. We see in Figure 6a that imageaccordingtothecenterpositionsandroughscalees- replacingthemodulesofaposemachinewiththeappropri- timations provided in the datasets if available. In datasets ately designed convolutional architecture provides a large such as LSP without these information, we estimate them boostof42.4percentagepointsoverthepreviousapproach accordingtojointpositionsorimagesizes. Fortesting,we of [29] in the high precision regime ([email protected]) and 30.9 performsimilarresizingandcropping(orpadding),butes- percentagepointsinthelowprecisionregime([email protected]). timatecenterpositionandscaleonlyfromimagesizeswhen necessary. Inaddition,wemergethebeliefmapsfromdif- Comparison on training schemes. We compare different ferentscales(perturbedaroundthegivenone)forfinalpre- variants of training the network in Figure 6b on the LSP dictions,tohandletheinaccuracyofthegivenscaleestima- dataset with person-centric (PC) annotations. To demon- tion. strate the benefit of intermediate supervision with joint WedefineandimplementourmodelusingtheCaffe[13] trainingacrossstages,wetrainthemodelinfourways: (i) libraries for deep learning. We publicly release the source training from scratch using a global loss function that en- code and details on the architecture, learning parameters, forcesintermediatesupervision(ii)stage-wise;whereeach design decisions and data augmentation to ensure full re- stage is trained in a feed-forward fashion and stacked (iii) producibility.3 assameas(i)butinitializedwithweightsfrom(ii),and(iv) MPIIHumanPoseDataset. WeshowinFigure8ourre- as same as (i) but with no intermediate supervision. We sults on the MPII Human Pose dataset [1] which consists find that network (i) outperforms all other training meth- morethan28000trainingsamples. Wechoosetorandomly ods, showing that intermediate supervision and joint train- augmentthedatawithrotationdegreesin[−40◦,40◦],scal- ing across stage is indeed crucial in achieving good per- ing with factors in [0.7,1.3], and horizonal flipping. The formance. The stagewise training in (ii) saturate at sub- evaluationisbasedonPCKhmetric[1]wheretheerrortol- optimal, and the jointly fine-tuning in (iii) improves from eranceisnormalizedwithrespecttoheadsizeofthetarget. thissub-optimaltotheaccuracylevelclosedto(i),however Becausethereoftenaremultiplepeopleintheproximityof witheffectivelylongertrainingiterations. the interested person (rough center position is given in the dataset),wemadetwosetsofidealbeliefmapsfortraining: Performanceacrossstages.Weshowacomparisonofper- oneincludesallthepeaksforeverypersonappearinginthe formanceacrosseachstageontheLSPdataset(PC)inFig- proximityoftheprimarysubjectandthesecondtypewhere ure6c. Weshowthattheperformanceincreasesmonoton- weonlyplacepeaksfortheprimarysubject. Wesupplythe ically until 5 stages, as the predictors in subsequent stages firstsetofbeliefmapstothelosslayersinthefirststageas makeuseofcontextualinformationinalargereceptivefield theinitialstageonlyreliesonlocalimageevidencetomake onthepreviousstagebeliefsmapstoresolveconfusionsbe- predictions.Wesupplythesecondtypeofbeliefmapstothe tweenpartsandbackground. Weseediminishingreturnsat the6thstage,whichisthenumberwechooseforreporting 3https://github.com/CMU-Perceptual-Computing-Lab/ ourbestresultsinthispaperforLSPandMPIIdatasets. convolutional-pose-machines-release t=1 t=2 t=3 t=1 t=2 t=3 t=1 t=2 t=3 t=1 t=2 t=3 eft L ht g Ri Wrists Elbows Wrists Elbows (a) (b) Figure7:ComparisonofbeliefmapsacrossstagesfortheelbowandwristjointsontheLSPdatasetfora3-stageCPM. PCKh total, MPII PCKh hip, MPII PCKh wrist & elbow, MPII PCKh knee, MPII PCKh ankle, MPII 100 90 % 80 e 70 on rat 5600 cti 40 ete 30 D 20 10 0 0 0.1 0.2 0.3 0.4 0.5 0 0.1 0.2 0.3 0.4 0.5 0 0.1 0.2 0.3 0.4 0.5 0 0.1 0.2 0.3 0.4 0.5 0 0.1 0.2 0.3 0.4 0.5 Normalized distance Normalized distance Normalized distance Normalized distance Normalized distance Ours 6−stage + LEEDS Ours 6−stage Pishchulin CVPR’16 Tompson CVPR’15 Tompson NIPS’14 Carreira CVPR’16 Figure8: QuantitativeresultsontheMPIIdatasetusingthePCKhmetric. Weachievestateoftheartperformanceandoutperformsignificantlyon difficultpartssuchastheankle. PCK total, LSP PC PCK hip, LSP PC PCK wrist & elbow, LSP PC PCK knee, LSP PC PCK ankle, LSP PC 100 90 % 80 e 70 on rat 5600 cti 40 ete 30 D 20 10 0 0 0.05 0.1 0.15 0.2 0 0.05 0.1 0.15 0.2 0 0.05 0.1 0.15 0.2 0 0.05 0.1 0.15 0.2 0 0.05 0.1 0.15 0.2 Normalized distance Normalized distance Normalized distance Normalized distance Normalized distance Ours 6−Stage + MPI Ours 6−Stage Pishchulin CVPR’16 (relabel) + MPI Tompson NIPS’14 Chen NIPS’14 Wang CVPR’13 Figure9: QuantitativeresultsontheLSPdatasetusingthePCKmetric. Ourmethodagainachievesstateoftheartperformanceandhasasignificant advantageonchallengingparts. losslayersofallsubsequentstages. Wealsofindthatsup- mary, our method improves the accuracy in all parts, over plyingtoallsubsequentstagesanadditionalheat-mapwith all precisions, across all view angles, and is the first one aGaussianpeakindicatingcenteroftheprimarysubjectis achievingsuchhighaccuracywithoutanypre-trainingfrom beneficial. other data, or post-inference parsing with hand-design pri- orsorinitializationofsuchastructuredpredictiontaskasin Our total PCKh-0.5 score achieves state of the art at [28, 39]. Our methods also does not need another module 87.95%(88.52%whenaddingLSPtrainingdata),whichis dedicatedtolocationrefinementasin[38]toachievegreat 6.11%higherthantheclosestcompetitor,anditisnotewor- high-precisionaccuracywithastride-8network. thythatontheankle(themostchallengingpart),ourPCKh- 0.5 score is 78.28% (79.41% when adding LSP training Leeds Sports Pose (LSP) Dataset. We evaluate our data), which is 10.76% higher than the closest competitor. method on the Extended Leeds Sports Dataset [15] that Thisresultshowsthecapabilityofourmodeltocapturelong consists of 11000 images for training and 1000 images distance context given ankles are the farthest parts from for testing. We trained on person-centric (PC) annotations head and other more recognizable parts. Figure 11 shows andevaluateourmethodusingthePercentageCorrectKey- our accuracy is also consistently significantly higher than points (PCK) metric [44]. Using the same augmentation othermethodsacrossvariousviewanglesdefinedin[1],es- scheme as for the MPI dataset, our model again achieves pecially in those challenging non-frontal views. In sum- stateoftheartat84.32%(90.5%whenaddingMPIItrain- PII M C LI F P S L Figure10: QualitativeresultsofourmethodontheMPII,LSPandFLICdatasetsrespectively. Weseethatthemethodisabletohandlenon-standard posesandresolveambiguitiesbetweensymmetricpartsforavarietyofdifferentrelativecameraviews. PCKh by Viewpoint PCK wrist, FLIC PCK elbow, FLIC 100 100 90 90 80 80 % e 70 CKh 0.5, % 45670000 Detection rat 2345600000 P 30 Ours 10 Pishchulin et al., CVPR’16 0 20 Tompson et al., CVPR’15 0 0.05 0.1 0.15 0.2 0 0.05 0.1 0.15 0.2 10 Carreira et al., CVPR’16 Normalized distance Normalized distance Tompson et al., NIPS’14 0 Ours 4−Stage Tompson et al., NIPS’14 Toshev et al., CVPR’14 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Tompson et al., CVPR’15 Chen et al., NIPS’14 Sapp et al., CVPR’13 Viewpoint clusters Figure12: QuantitativeresultsontheFLICdatasetfortheelbowand wristjointswitha4-stageCPM.Weoutperformallcompetingmethods. Figure 11: Comparing PCKh-0.5 across various viewpoints in the 5.Discussion MPIIdataset.Ourmethodissignificantlybetterinalltheviewpoints. Convolutional pose machines provide an end-to-end ar- chitecture for tackling structured prediction problems in ing data). Note that adding MPII data here significantly computervisionwithouttheneedforgraphical-modelstyle boosts our performance, due to its labeling quality being inference. We showed that a sequential architecture com- muchbetterthanLSP.BecauseofthenoisylabelintheLSP posed of convolutional networks is capable of implicitly dataset, Pishchulin et al. [28] reproduced the dataset with learning a spatial models for pose by communicating in- originalhighresolutionimagesandbetterlabelingquality. creasingly refined uncertainty-preserving beliefs between stages. Problems with spatial dependencies between vari- FLIC Dataset. We evaluate our method on the FLIC ablesariseinmultipledomainsofcomputervisionsuchas Dataset[32]whichconsistsof3987imagesfortrainingand semanticimagelabeling,singleimagedepthpredictionand 1016imagesfortesting. Wereportaccuracyasperthemet- objectdetectionandfutureworkwillinvolveextendingour ric introduced in Sapp et al. [32] for the elbow and wrist architecturetotheseproblems. Ourapproachachievesstate joints in Figure 12. Again, we outperform all prior art at oftheartaccuracyonallprimarybenchmarks,howeverwe [email protected]%onelbowsand95.03%onwrists.In do observe failure cases mainly when multiple people are higherprecisionregionouradvantageisevenmoresignifi- in close proximity. Handling multiple people in a single cant: 14.8percentagepointsonwristsand12.7percentage end-to-end architecture is also a challenging problem and [email protected], and8.9percentagepoints aninterestingavenueforfuturework. [email protected]. References [24] T.Pfister, J.Charles, andA.Zisserman. Flowingconvnets forhumanposeestimationinvideos. InICCV,2015. [1] M.Andriluka,L.Pishchulin,P.Gehler,andB.Schiele. 2D [25] P.PinheiroandR.Collobert. Recurrentconvolutionalneural humanposeestimation: Newbenchmarkandstateoftheart networksforscenelabeling. InICML,2014. analysis. InCVPR,2014. [26] L.Pishchulin,M.Andriluka,P.Gehler,andB.Schiele.Pose- [2] M.Andriluka, S.Roth, andB.Schiele. Pictorialstructures letconditionedpictorialstructures. InCVPR,2013. revisited: Peopledetectionandarticulatedposeestimation. [27] L. Pishchulin, M. Andriluka, P. Gehler, and B. Schiele. InCVPR,2009. Strongappearanceandexpressivespatialmodelsforhuman [3] M.Andriluka,S.Roth,andB.Schiele. Monocular3Dpose poseestimation. InICCV,2013. estimationandtrackingbydetection. InCVPR,2010. [28] L.Pishchulin, E.Insafutdinov, S.Tang,B.Andres, M.An- [4] Y.Bengio, P.Simard, andP.Frasconi. Learninglong-term driluka,P.Gehler,andB.Schiele.Deepcut:Jointsubsetpar- dependencieswithgradientdescentisdifficult. IEEETrans- titionandlabelingformultipersonposeestimation. arXiv actionsonNeuralNetworks,1994. preprintarXiv:1511.06645,2015. [5] D. Bradley. Learning In Modular Systems. PhD thesis, [29] V. Ramakrishna, D. Munoz, M. Hebert, J. Bagnell, and Robotics Institute, Carnegie Mellon University, Pittsburgh, Y.Sheikh. PoseMachines: ArticulatedPoseEstimationvia PA,2010. InferenceMachines. InECCV,2014. [6] J.Carreira,P.Agrawal,K.Fragkiadaki,andJ.Malik.Human [30] D.Ramanan,D.A.Forsyth,andA.Zisserman.StrikeaPose: poseestimationwithiterativeerrorfeedback. arXivpreprint Trackingpeoplebyfindingstylizedposes. InCVPR,2005. arXiv:1507.06550,2015. [31] S. Ross, D. Munoz, M. Hebert, and J. Bagnell. Learning [7] X. Chen and A. Yuille. Articulated pose estimation by a message-passing inference machines for structured predic- graphicalmodelwithimagedependentpairwiserelations.In tion. InCVPR,2011. NIPS,2014. [32] B.SappandB.Taskar.MODEC:MultimodalDecomposable [8] M.Dantone,J.Gall,C.Leistner,andL.VanGool. Human ModelsforHumanPoseEstimation. InCVPR,2013. poseestimationusingbodypartsdependentjointregressors. [33] L. Sigal and M. Black. Measure locally, reason globally: InCVPR,2013. Occlusion-sensitive articulated pose estimation. In CVPR, [9] P.FelzenszwalbandD.Huttenlocher.Pictorialstructuresfor 2006. objectrecognition. InIJCV,2005. [34] R.StewartandM.Andriluka. End-to-endpeopledetection [10] X. Glorot and Y. Bengio. Understanding the difficulty of incrowdedscenes. arXivpreprintarXiv:1506.04878,2015. training deep feedforward neural networks. In AISTATS, [35] M. Sun and S. Savarese. Articulated part-based model for 2010. jointobjectdetectionandposeestimation. InICCV,2011. [11] K.He,X.Zhang,S.Ren,andJ.Sun. Deepresiduallearn- [36] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, ingforimagerecognition.arXivpreprintarXiv:1512.03385, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabi- 2015. novich. Going deeper with convolutions. arXiv preprint [12] S. Hochreiter, Y. Bengio, P. Frasconi, and J. Schmidhuber. arXiv:1409.4842,2014. Gradient flow in recurrent nets: the difficulty of learning [37] Y.Tian,C.L.Zitnick,andS.G.Narasimhan. Exploringthe long-term dependencies. A Field Guide to Dynamical Re- spatialhierarchyofmixturemodelsforhumanposeestima- currentNeuralNetworks,IEEEPress,2001. tion. InECCV.2012. [13] Y.Jia,E.Shelhamer,J.Donahue,S.Karayev,J.Long,R.Gir- [38] J.Tompson,R.Goroshin,A.Jain,Y.LeCun,andC.Bregler. shick, S. Guadarrama, and T. Darrell. Caffe: Convolu- Efficientobjectlocalizationusingconvolutionalnetworks.In tionalarchitectureforfastfeatureembedding.arXivpreprint CVPR,2015. arXiv:1408.5093,2014. [39] J.Tompson,A.Jain,Y.LeCun,andC.Bregler.Jointtraining [14] S.JohnsonandM.Everingham. Clusteredposeandnonlin- ofaconvolutionalnetworkandagraphicalmodelforhuman earappearancemodelsforhumanposeestimation.InBMVC, poseestimation. InNIPS,2014. 2010. [40] A.ToshevandC.Szegedy. DeepPose: Humanposeestima- [15] S.JohnsonandM.Everingham. Learningeffectivehuman tionviadeepneuralnetworks. InCVPR,2013. poseestimationfrominaccurateannotation.InCVPR,2011. [41] Z.TuandX.Bai. Auto-contextanditsapplicationtohigh- [16] L.KarlinskyandS.Ullman. Usinglinkingfeaturesinlearn- level vision tasks and 3d brain image segmentation. In ingnon-parametricpartmodels. InECCV,2012. TPAMI,2010. [17] M. Kiefel and P. V. Gehler. Human pose estimation with [42] Y. Wang and G. Mori. Multiple tree models for occlusion fieldsofparts. InECCV.2014. andspatialconstraintsinhumanposeestimation. InECCV, [18] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet 2008. classification with deep convolutional neural networks. In [43] Y.YangandD.Ramanan. Articulatedposeestimationwith NIPS,2012. flexiblemixtures-of-parts. InCVPR,2011. [19] X.LanandD.Huttenlocher. Beyondtrees: Common-factor [44] Y.YangandD.Ramanan. Articulatedhumandetectionwith modelsfor2Dhumanposerecovery. InICCV,2005. flexiblemixturesofparts. InTPAMI,2013. [20] C.-Y.Lee,S.Xie,P.Gallagher,Z.Zhang,andZ.Tu.Deeply- supervisednets. InAISTATS,2015. [21] J.Long, E.Shelhamer, andT.Darrell. Fullyconvolutional networksforsemanticsegmentation. InCVPR,2015. [22] D.Munoz,J.Bagnell,andM.Hebert. Stackedhierarchical labeling. InECCV,2010. [23] W.Ouyang,X.Chu,andX.Wang. Multi-sourcedeeplearn- ingforhumanposeestimation. InCVPR,2014.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.