ebook img

Lifting from the Deep: Convolutional 3D Pose Estimation from a Single Image PDF

8.1 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Lifting from the Deep: Convolutional 3D Pose Estimation from a Single Image

Lifting from the Deep: Convolutional 3D Pose Estimation from a Single Image DenisTome ChrisRussell LourdesAgapito UniversityCollegeLondon TheTuringInstitute UniversityCollegeLondon [email protected] [email protected] [email protected] http://visual.cs.ucl.ac.uk/pubs/liftingFromTheDeep 7 1 0 Abstract coded in our novel pre-trained model of 3D human pose. 2 Ourmanifoldislearnedexclusivelyfrom3Dmocap b We propose a unified formulation for the problem of Theinformationcapturedbyour3Dhumanposemodel e F 3D human pose estimation from a single raw RGB image isembeddedintheCNNarchitectureasanadditionallayer thatreasonsjointlyabout2Djointestimationand3Dpose that lifts 2D landmark coordinates into 3D while impos- 5 reconstruction to improve both tasks. We take an inte- ingthattheylieonthespaceofphysicallyplausibleposes. 2 grated approach that fuses probabilistic knowledge of 3D Theadvantageofintegratingtheoutputproposedbythe2D ] human pose with a multi-stage CNN architecture and uses landmark location predictors – based purely on image ap- V theknowledgeofplausible3Dlandmarklocationstorefine pearance – with the 3D pose predicted by a probabilistic C the search for better 2D locations. The entire process is model, is that the 2D landmark location estimates are im- . trainedend-to-end,isextremelyefficientandobtainsstate- provedbyguaranteeingthattheysatisfytheanatomical3D s c of-the-art results on Human3.6M outperforming previous constraints encapsulated in the human 3D pose model. In [ approachesbothon2Dand3Derrors. thisway,bothtasksclearlybenefitfromeachother. 3 A further advantage of our approach is that our 2D and v 3D training data sources may be completely independent. 5 1.Introduction Ourdeeparchitectureonlyneedsthatimagesareannotated 9 with 2D poses, not 3D poses. The human pose model is 2 Estimating the full 3D pose of a human from a single trainedindependentlyandexclusivelyfrom3Dmocapdata. 0 0 RGB image is one of the most challenging problems in Thisdecouplingbetween2Dand3Dtrainingdatapresents . computer vision. It involves tackling two inherently am- a huge advantage since we can augment our training sets 1 0 biguoustasks. First,the2Dlocationofthehumanjoints,or completelyindependently. Forinstancewecantakeadvan- 7 landmarks,mustbefoundintheimage,aproblemplagued tageofextra2Dposeannotationswithoutthe needfor3D 1 with ambiguities due to the large variations in visual ap- groundtruthorextendthe3Dtrainingdatatofurthermocap v: pearance caused by different camera viewpoints, external datasetswithouttheneedforsynchronized2Dimages. i and self occlusions or changes in clothing, body shape or Ourcontribution: Inthiswork,weshowhowtointegrate X illumination. Next, lifting the coordinates of the 2D land- apre-learned3Dhumanposemodeldirectlywithinanovel ar marksinto3Dfromasingleimageisstillanill-posedprob- CNNarchitecture(illustratedinfigure1)forjoint2Dland- lem – the space of possible 3D poses consistent with the mark and 3D human pose estimation. In contrast to pre- 2Dlandmarklocationsofahuman,isinfinite. Findingthe existing methods, we do not take a pipeline approach that correct 3D pose that matches the image requires injecting takes 2D landmarks as given. Instead, we show how such additionalinformationusuallyintheformof3Dgeometric amodelcanbeusedaspartoftheCNNarchitectureitself, posepriorsandtemporalorstructuralconstraints. andhowthearchitecturecanlearntousephysicallyplausi- We propose a new joint approach to 2D landmark de- ble3Dreconstructionsinitssearchforbetter2Dlandmark tectionandfull3DposeestimationfromasingleRGBim- locations. Our method achieves state-of-the-art results on age that takes advantage of reasoning jointly about the es- theHuman3.6Mdatasetbothintermsof2Dand3Derrors. timationof2Dand3Dlandmarklocationstoimproveboth tasks. WeproposeanovelCNNarchitecturethatlearnsto 2.RelatedWork combine the image appearance based predictions provided by convolutional-pose-machine style 2D landmark detec- We first describe methods that assume that 2D joint lo- tors [43], with the geometric 3D skeletal information en- cations are provided as input and focus on solving the 3D 1 belief maps x18 STAGE t=6 Input image STAGE t=3 feature extraction p2reDd ijcotiinotn beliexf1 m8aps 2D featurseh eaxretrdaction p2reDd ijcotiinotn beliexf1 m8aps FUSION UT C P C P C P C C C C C P C P C P C C C C C C OUTP 9x92x9x92x9x92x5x5 9x91x11x1 9x92x9x92x9x92x5x5 11x11 1x1 projected pose projected pose 2D belief maps x18 belief maps x18FUSION output 3D pose Probabilistic 3D pose model 3D pose STAGE t=1 Probabilistic 3D pose model 3D pose STAGE t=2 Figure1: Ourmultistagedeeparchitecturefor2D/3Dhumanposeestimation.Eachstagetproducesasoutputasetofbeliefmapsforthe locationofthe2Dlandmarks(oneperlandmark). Thebeliefmapsfromeachstaget,combinedwithlearnedimagefeatures,areusedas inputtothenextstaget+1(redarrowsshowinformationflowingfromonestagetothenext). Internally,eachstagelearnstocombine: (a)appearance-basedbeliefmapsprovidedbyconvolutional2Dlandmarkpredictors, with(b)projectedposebeliefmaps, proposedby ournewprobabilistic3Dposemodelthatencodes3Dstructuralinformation. The3Dposelayerisresponsibleforlifting2Dlandmark coordinatesinto3Dandprojectingthemontothespaceofvalid3Dposes.Thesetwobeliefmapsarethenfusedintoasinglesetofoutput proposalsforthe2Dlandmarklocationsperstage.Theaccuracyofthe2Dand3Dlandmarklocationsincreasesprogressivelythroughthe stages.Theoverallarchitectureistrainedend-to-endusingback-propagation.[Bestviewedincolor.] lifting problem and follow with methods that learn to esti- length constraints. Although these approaches can recon- matethe3Dposedirectlyfromimages. struct3Dposefromasingleimage,theirbestresultscome 3D pose from known 2D joint positions: A large body fromimposingtemporalsmoothnessonthereconstructions of work has focused on recovering the 3D pose of people ofavideosequence. givenperfect2Djointpositionsasinput. Earlyapproaches Recently, Zhao et al. [45] achieved state-of-the-art re- [18,33,24,6]tookadvantageofanatomicalknowledgeof sultsbytrainingasimpleneuralnetworktorecover3Dpose the human skeleton or joint angle limits to recover pose fromknown2Djointpositions. Whiletheresultsonperfect fromasingleimage. Morerecentmethods[13,27,3]have 2D input data are impressive, the inaccuracies in 2D joint focused on learning a prior statistical model of the human estimationarenotmodeledandtheperformanceofthisap- bodydirectlyfrom3Dmocapdata. proachcombinedwithjointdetectorsisunknown. Non-rigid structure from motion approaches (NRSfM) 3Dposefromimages: Mostapproachesto3Dposeinfer- also recover 3D articulated motion [8, 4, 14, 19] given encedirectlyfromimagesfallintooneoftwocategories:(i) known 2D correspondences for the joints in every frame modelsthatlearntoregressthe3Dposedirectlyfromimage of a monocular video. Their huge advantage, as unsuper- features and (ii) pipeline approaches where the 2D pose is visedmethods,istheydonotneed3Dtrainingdata,instead firstestimated,typicallyusingdiscriminativelytrainedpart learningalinearbasisforthe3Dposespurelyfrom2Ddata. models or joint predictors, and then lifted into 3D. While Their main drawback is their need for significant camera regression based methods suffer from the need to annotate movement throughout the sequence to guarantee accurate allimageswithgroundtruth3Dposes–atechnicallycom- 3Dreconstruction. RecentworkonNRSfMappliedtohu- plex and elaborate process – for pipeline approaches the manposeestimationhasfocusedonescapingtheselimita- challengeishowtoaccountforuncertaintyinthemeasure- tionsbytheuseofalinearmodeltorepresentshapevaria- ments. Crucialtobothtypesofapproachesisthequestion tionsofthehumanbody. Forinstance,[10]definedagen- ofhowtoincorporatethe3Ddependenciesbetweenthedif- erativemodelbasedontheassumptionthatcomplexshape ferentbodyjointsortoleverageotheruseful3Dgeometric variations can be decomposed into a mixture of primitive informationintheinferenceprocess. shapevariationsandachievecompetitiveresults. Many earlier works on human pose estimation from a Representinghuman3Dposeasalinearcombinationof single image relied on discriminatively trained models to asparsesetof3Dbases,pre-trainedusing3Dmocapdata, learnadirectmappingfromimagefeaturessuchassilhou- has also proved a popular approach for articulated human ettes, HOG or SIFT, to 3D human poses without passing motion[27,42,47].While[47]proposeaconvexrelaxation through2Dlandmarkestimation [1,12,11,23,31]. tojointlyestimatethecoefficientsofthesparserepresenta- More recent direct approaches make use of deep learn- tion and the camera viewpoint [27] and [42] enforce limb ing[20,21,39,40]. Regression-basedapproachestrainan end-to-end network to predict 3D joint locations directly ing2Dposesinto3Dandpropagating3Dinformationabout from the image [40, 20, 21, 46]. While [21] incorporate theskeletalstructuretothe2Dconvolutionallayers. Inthis model joint dependencies in the CNN via a max-margin way, thepredictionof2Dposebenefitsfromthe3Dinfor- formalism, others [46] impose kinematic constraints by mationencoded. Section4describesournewprobabilistic embedding a differentiable kinematic model into the deep 3Dmodelofhumanpose,trainedonadatasetof3Dmocap learningarchitecture. Morerecently, Tekinetal.[34]pro- data.Section5describesallthenewcomponentsandlayers pose a deep regression architecture for structured predic- of our CNN architecture. Finally, Section 6 describes our tionthatcombinestraditionalCNNsforsupervisedlearning experimental evaluation on the Human3.6M dataset where withanauto-encoderthatimplicitlyencodes3Ddependen- weobtainstate-of-the-artresults.Inadditionweshowqual- ciesbetweenbodyparts. itativeresultsonimagesfromtheMPIIdataset. AsCNNshavebecomemoreprevalent,2Djointestima- tion[43]hasbecomeincreasinglyreliableandmanyrecent 4.OurProbabilistic3DModelofHumanPose workshavelookedtoexploitthisusingapipelineapproach. Onefundamentalchallengeincreatingmodelsofhuman Papers such as [9, 16, 39, 25] first estimate 2D landmarks posesliesinthelackofaccessto3Ddataofsufficientva- andlater3Dspatialrelationshipsareimposedbetweenthem riety to characterize the space of human poses. To com- usingstructuredlearningorgraphicalmodels. pensateforthislackofdataweidentifyandeliminatecon- Simo-Serra et al. [32] were one of the first to propose foundingfactorssuchasrotationinthegroundplane,limb an approach that naturally copes with the noisy detections length, and left-right symmetry that lead to conceptually inherent to off-the-shelf body part detectors by modeling similarposesbeingunrecognizedinthetrainingdata. theiruncertaintyandpropagatingitthrough3Dshapespace We eliminate some factors by simple pre-processing. while guaranteeing that geometric and kinematic 3D con- Varianceduetolimb-lengthisaddressedbynormalizingthe straintsweresatisfied. Thework[30]alsoestimatesthelo- data such that the sum of squared limb lengths on the hu- cationof2Djointsbeforepredicting3Dposeusingappear- manskeletonisone;whileleft-rightsymmetryisexploited anceandtheprobable3Dposeofdiscoveredpartsusinga byflippingeachposeinthex-axisandre-annotatingleftas hierarchical Bayesian non-parametric model. Another re- rightandvice-versa. centexampleisBogoetal.[7],whofitadetailedstatistical 3Dbodymodel[22]to2Djointproposals. 4.1.Aligning3DHumanPosesintheTrainingSet Zhouetal.[48]tacklestheproblemof3Dposeestima- Allowing for rotational invariance in the ground-plane tion for a monocular image sequence integrating 2D, 3D is more challenging and requires integration with our data andtemporalinformationtoaccountforuncertaintiesinthe model. We seek the optimal rotations for each pose such model and themeasurements. Similarto our proposed ap- that after rotating the poses they are closely approximated proach, Zhou et al.’s method [48] does not need synchro- byalow-rankcompactGaussiandistribution. nized 2D-3D training data, i.e. it only needs 2D pose an- We formulate this as a problem of optimization over a notations to train the CNN joint regressor and a separate set of variables. Given a set of N training 3D poses, each 3Dmocapdatasettolearnthe3Dsparsebasis. Unlikeour represented as a (3×L) matrix P of 3D landmark loca- approach,itreliesontemporalsmoothnessforitsbestper- i tions,wherei∈{1,2,..,N}andListhenumberofhuman formance,andperformspoorlyonasingleimage. joints/landmarks; we seek global estimates of an average Finally,Wuetal.[?]’s3DInterpreterNetwork,arecent 3D pose µ, a set of J orthonormal basis matrices1 e and approach to estimate the skeletal structure of common ob- noisevarianceσ,alongsidepersamplerotationsR andba- jects (chairs, sofas, ...) bears similarities with our method. i siscoefficientsa suchthatthefollowingestimateismini- Whileourapproachessharecommongroundinthedecou- i mized plingof3Dand2Dtrainingdataandtheuseofprojection from 3D to improve 2D predictions the network architec- N turesareverydifferentand,unlikeus,theydonotcarryout argminR,µ,a,e,σ(cid:88)(cid:0)||Pi−Ri(µ+ai·e)||22 (1) aquantitativeevaluationontheproblemof3Dhumanpose i=1 estimationproblem. J J +(cid:88)(a ·σ )2+ln(cid:88)σ2(cid:1) i,j j j 3.OurArchitecture j=1 j=1 (cid:80) Figure 1 illustrates the main contribution of our ap- Whereai·e = jai,jej isthetensoranalogofamul- proach, a new multi-stage CNN architecture that can be tiplicationbetweenavectorandamatrix, and||·||2 isthe 2 trained end-to-end to estimate jointly 2D and 3D joint lo- 1Whenwesayeisasetoforthonormalbasismatriceswemeanthat cations. Crucially it includes a novel layer, based on our eachmatrix,ifunwrappedintoavector,isofunitnormandorthogonalto probabilistic3Dmodelofhumanpose,responsibleforlift- allotherunwrappedmatrices. squaredFrobeniusnormofthematrix.Herethey-axisisas- e) sumedtopointup,andtherotationmatricesR considered i aregroundplanerotationsoftheform f)   cosθ 0 sinθ d) i i Ri = 0 1 0  (2) −sinθ 0 cosθ i i h) c) With the large number of 3D pose samples considered (of the order of 1 million when training on the Human3.6M dataset[15]),andthecomplexinter-dependenciesbetween samplesforeandσ,thememoryrequirementsmeanthatit b) isnotpossibletosolvedirectlyasajointoptimizationover all variables using a non-linear solver such as Ceres [2]. Instead, we carefully initialize and then alternate between performing closed-form PPCA [37] to update estimates of g) µ,a,e,σ; andupdatingtherotationsR usingCeres[2]to a) i minimize the above error. As we do this, we steadily in- creasethesizeofthebasisfrom1throughtoitstargetsize Figure 2: Visualization of the 3D training data after align- J. Thisstopsapparentdeformationsthatcouldberesolved ment(seesection4.1)using2DPCA.Noticehowallposes throughrotationsfrombecominglockedintothebasisatan havethesameorientation. Standing-upposesa),b),c)and earlystage,andempiricallyleadstolowercostsolutions. d)areallclosetoeachotherandfarfromsitting-downposes ToinitializeweuseavariantoftheTomasi-Kanade[38] f)andh)whichformanotherclearcluster. algorithm to estimate the mean 3D pose µ. As the y com- ponentisnotalteredbyplanarrotations,wetakeasoures- closetoposesseenintraining. Thisisemphaticallynotthe timateofthey componentofµ, themeanofeachpointin caseinmuchoftheHuman3.6Mdataset,whichweusefor they direction. Forthexandz components,weinterleave evaluation. Zooming in on the edges of Figure 2 reveals the x and z components of each sample and concatenate manyisolatedpathswheremotionsoccuronceandthenare themintoalarge2N ×LmatrixM,andfindtheranktwo neverrevisitedagain. approximationofthissuchthatM ≈ A·B. Wethencal- culateAˆ byreplacingeachadjacentpairofrowsofAwith Nonetheless,itispreciselytheseregionsoflow-density theclosestorthonormalmatrixofranktwo,andtakeAˆ†M that we are interested in modeling. As such, we seek a asourestimate2ofthexandzcomponentsofµ. coarse representation of the pose space that says some- thing about the regions of low density but also character- The end result of this optimization is a compact low- izesthemulti-modalnatureoftheposespace. Assuch,we rank approximation of the data in which all reconstructed choose to represent our data as a mixture of probabilistic posesappeartohavethesameorientation(seefigure2). In PCAmodelsusingfewclusters,andtrainedusingtheEM- the next section we extend our model to be described as a algorithm [37]. When using a small number of clusters, it multi-modal distribution to better capture the variations in isimportanttoinitializethealgorithmcorrectly,asacciden- thespaceof3Dhumanposes. tallyinitializingwithmultipleclustersaboutasinglemode, 4.2.AMulti-ModalModelof3DHumanPose canleadtopoordensityestimates. Toinitializeourclusters wemakeuseofasimpleheuristic. AlthoughitispossibletodirectlyusethelearnedGaus- Wefirstsub-samplethealignedposes(whichwereferto sian model in the previous section 4.1 to estimate the 3D as P), and then compute the Euclidean distance d among (see results in section 6), inspection of figure 2 shows that pairs. Weseekasetofk samplesS suchthatthedistance thedataisnot-Gaussiandistributedandisbetterdescribed betweenpointsandtheirnearestsampleisminimized usingamulti-modaldistribution.Indoingthis,weareheav- (cid:88) ilyinspiredbothbyapproachessuchas[26]whichcharac- argmin mind(s,p) (3) terizethespaceofhumanposesasamixtureofPCAbases, S s∈S p∈P andbyrelatedworkssuchas[41,8]thatrepresentposesas WefindS usinggreedyselection,holdingourpreviouses- aninterpolationbetweenexemplars. Theseapproachesare timateofS constant,anditerativelyselectingthenextcan- extremely good at modeling tightly distributed poses (e.g. didatessuchthat{s}∪S minimizestheabovecost. Ase- walking)wheresamplesinthetestingdataarelikelytobe lectionof3Dposesamplesfoundusingthisprocedurecan 2A†beingthepseudo-inverseofA. be seen in the rendered poses of Figure 2. In practice, we stopproposingcandidateswhentheyoccurtooclosetothe tweenthebodyparts. Wedescribetheprojectionfrom3D existing candidates, as shown by samples (a–d), and only to2Dinsection5.4. chooseonecandidatefromthedominantmode. 2DFusionlayer:Thefinallayerineachstage(describedin Given these candidates for cluster centers, we assign section5.5)learnstheweightstofusethetwosetsofbelief eachalignedpointtoaclusterrepresentingitsnearestcan- mapsintoauniquesetwhichistheninputintothenextstage didate and then run the EM algorithm of [37], building a t+1ofthearchitecture. mixtureofprobabilisticPCAbases. Our novel layers were implemented as an extension of the published code of Convolutional Pose Machines [43] 5. A New Convolutional Architecture for 2D inside the caffe framework as python layers, and with all and3DPoseInference weights updated using Stochastic Gradient Descent with momentum. Our3DposeinferencefromasingleRGBimagemakes useofamultistagedeepconvolutionalarchitecture,trained 5.2.PredictingCNN-basedbelief-maps end-to-end, that repeatedly fuses and refines 2D and 3D ConvolutionalPoseMachines[43]canbeunderstoodas poses,andasecondmodulewhichtakesthefinalpredicted anupdatingoftheearlierworkofRamakrishnaetal.[28]to 2Dlandmarksandliftsthemonelasttimeinto3Dspacefor use a deep convolutional architecture. In both approaches, ourfinalestimate(seeFigure1). at each stage t and for each landmark p, the algorithm re- Atitsheart,ourarchitectureisanovelrefinementofthe turnsdenseperpixelbeliefmapsbp[u,v],whichshowhow Convolutional Pose Machine of Wei et al. [43], who rea- t confidentitisthatajointcenterorlandmarkoccursinany sonedexclusivelyin2D,andproposedanarchitecturethat givenpixel(u,v). SeeFigure4aforavisualizationofstage iteratively refined 2D pose estimations of landmarks using 1 maps. For stages t ∈ {2,...,T} the belief maps are a a mixture of knowledge of the image and of the estimates functionofnotjusttheinformationcontainedintheimage oflandmarklocationsofthepreviousstage.Wemodifythis butalsotheinformationcomputedbythepreviousstage. architecturebygenerating,ateachstage,projected3Dpose In the case of convolutional pose machines, and in our belief maps which are fused in a learned manner with the work which uses the same architecture, a summary of the standardmaps. Fromanimplementationpointofviewthis convolutionwidthsandarchitecturedesignisshowninFig- isdonebyintroducingtwodistinctlayers,theprobabilistic ure1, withmoredetailsoftraininggiveninConvolutional 3Dposelayerandthefusionlayer(seefigure1). PoseMachines(CPM)[43]. Figure 3 shows how the 2D uncertainty in the belief Both approaches[43, 28] predict the locations of differ- maps is reduced at each stage of the architecture and how entlandmarkstothosecapturedintheHuman3.6Mdataset. the accuracy of the 3D poses increases throughout the Assuchtheinputandoutputlayersineachstageofthear- stages. chitecture are replaced with a larger set to account for the 5.1.Architectureofeachstage greater number of joints. The new architecture is then ini- tialized by using the weights with those found in CPM’s Our entire architecture consists of 6 stages. Each stage model for all pre-existing layers with the new layers ran- inoursequentialarchitectureismadeoutof4clearcompo- domlyinitialized. nents(seefigure1): Afterretraining,convolutionalposemachinesreturnper- PredictingCNN-basedbelief-maps: weuseasetofcon- pixelestimatesoflandmarklocations,whileourtechniques volutional and pooling layers, equivalent to those used in for3Destimation(describedinthenextsection)makeuse the original CPM architecture [43], that combine evidence of 2D locations. To transform these belief maps into loca- obtainedfromimagelearnedfeatureswiththebeliefmaps tions, we select the most confident pixel as the location of obtained from the previous stage (t−1) to predict an up- eachlandmark dated set of belief maps for the 2D human joint positions. Thisstepisdescribedinsection5.2. Y =argmaxb [u,v] (4) p p Lifting 2D belief-maps into 3D: the output of the CNN- (u,v) basedbeliefmapsistheninputtoanewlayerthatusesour new pre-trained probabilistic 3D human pose model to lift 5.3.Lifting2Dbelief-mapsinto3D the proposed 2D poses into 3D. The process of lifting 2D into3Disdescribedinsection5.3. We follow [48] in assuming a weak perspective model, Projected2Dposebeliefmaps:The3Dposeestimatedby and first describe the simplest case of estimating the 3D our3Dposeinferencelayeristhenprojectedbackontothe poseofasingleframeusingauni-modalGaussian3Dpose image plane to produce a new set of projected pose belief model as described in section 4. This model is composed maps. Thesebeliefmapsencapsulate3Ddependenciesbe- of a mean shape µ, a set of basis matrices e and variances Stage1 Stage2 Stage3 Stage4 Stage5 Stage6 Err:9.19mm Err:7.30mm Err:6.64mm Err:3.34mm Err:3.28mm Err:3.10mm Figure3: Resultsreturnedbydifferentstagesofthearchitecture. Top:Evolutionofthe2Dskeletonafterprojectingthe3Dpointsback intothe2Dspace;Center:EvolutionofthebeliefsforthelandmarkLefthandthroughthestages. Bottom:3Dskeletonwiththerelative meanerrorperlandmarkinmillimeters.Evenwhenthelandmarklocationsareincorrect,themodelreturnsaphysicallyplausiblesolution. (a)Stage1bp (b)Projectedˆbp (c)FusionFp Figure 4: Evolution of the beliefs in stage 1 of the archi- tecture for landmark Spine. Beliefs are fused together by Figure 5: Landmark refinement: Left: 2D predicted land- thefusionlayer:a)Initialconvolutionalbelief;b)Projected markpositions;Right: improvedpredictionsusingthepro- Poseestimation;c)Fusedinformation. Seesection5.4. jected3Dpose. σ2,andfromthiswecancomputethemostprobablesample convexinaandstogether3,foranunknownrotationmatrix fromthemodelthatcouldgiverisetoaprojectedimage. Rtheproblemisextremelynon-convex–evenifaisknown –andpronetostickinginlocalminimausingfirstorsecond argmin||Y −sΠER(µ+a·e)||2+||σ·a||2 (5) ordergradientdescent. Thelocaloptimaoftenliefarapart 2 2 R,a inposespaceandgettingstuckinapooroptimaleadstoa significantlyworse3Dreconstructions. WhereΠisthecanonicalorthographicprojectionmatrix,E aknownexternalcameracalibrationmatrix,andstheesti- 3Toseethisconsiderthetrivialreparameterizationwherewesolvefor mated per-frame scale. Although, given R this problem is sµ+b·eandthenleta=b/s. 1.5 1.5 1.5 1.0Z 1.0Z 1.0Z 0.5 0.5 0.5 0.0 0.0 0.0 -X0.10.00.1-0.30.0Y0.4 -0.X10.00.1-0.20.0Y0.3 -0.4X0.00.2 -0.40.0Y0.4 2.0 2.0 2.0 1.5 1.5 1.5 1.0Z 1.0Z 1.0Z 0.5 0.5 0.5 0.0 0.0 0.0 -0.3X0.00.1-0.30.00Y.2 -0.3X0.0 0.2 -0.30.00Y.4 -0.3X0.00.1-0.20.00Y.2 2.0 1.5 1.5 1.5 1.0Z 1.0Z 1.0 Z 0.5 0.5 0.5 0.0 0.0 0.0 -0.X30.00.1-0.40.0Y0.4 -0X.20.00.1 -0.30.0Y0.5 -2.2 X 0.0 0.6 -0.60.00.8Y 2.0 2.0 1.5 1.5 1.5 1.0Z 1.0Z 1.0 Z 0.5 0.5 0.5 0.0 0.0 0.0 -0.5X 0.00.2 -0.50.0Y0.6 -0.5X 0.00.2 -0.40.0Y0.4 -1.2 X 0.0 0.7 -0.70.0 Y1.0 Figure 6: Results from the Human3.6M dataset. The identified 2D landmark positions and 3D skeleton is shown for each posetakenfromdifferentactions: Walking,Phoning,Greeting,Discussion,SittingDown. WetakeadvantageofthematrixR’srestrictedformthat 5.4.Projecting3Dposesonto2Dbeliefmaps allows it to be parameterized in terms of a single angle θ Ourprojectedposemodelisinterleavedthroughoutthe (see(2)). Ratherthanattemptingtosolvethisoptimization architecture(seeFigure1).Thegoalistocorrectthebeliefs problem using local gradient based methods we quantize regardinglandmarklocationsateachstage,byfusingextra over the space of possible rotations, and for each choice informationabout3Dphysicalplausibility. Giventhesolu- ofrotation,weholdthisfixedandsolveforsanda,before tionR,s,andafromthepreviouscomponent,weestimate pickingtheminimumcostsolutionofanychoiceofR.With aphysicallyplausibleprojected3Dposeas fixed choices of rotation the terms ΠERµ and ΠERe can becomputedinadvance,andfindingtheoptimalabecomes Yˆ =sΠER(µ+a·e) (6) asimplelinearleastsquareproblem. whichisthenembeddedinabeliefmapas Thisprocessishighlyefficientandbyoversamplingthe (cid:40) 1 if(i,j)=Yˆ rotationsandexhaustivelycheckingin10,000locationswe ˆbp = p (7) can guarantee that a solution extremely close to the global i,j 0 otherwise. optimaisfound. Inpractice,using20samplesandrefining andthenconvolvedusingGaussianfilters. therotationsandbasiscoefficientsofthebestfoundsolution usinganon-linearleastsquaressolverobtainsthesamere- 5.5.2DFusionofbeliefmaps construction,andwemakeuseofthefasteroptionofcheck- The 2D belief maps predicted by our probabilistic 3D ing80locationsandusingthebestfoundsolutionasour3D pose model are fused with the CNN-based belief maps bp estimate.Thisputsusclosetotheglobaloptimaandhasthe accordingtothefollowingequation sameaverageaccuracyasfindingtheglobaloptima. More- over,itallowsustoupgradefromsparselandmarklocations fp =w ∗bp+(1−w )∗ˆbp (8) t t t t t to3DusingasingleGaussianataround3,000framesasec- wherew ∈ [0,1]isaweighttrainedaspartoftheend-to- ondusingpythoncodeonastandardlaptop. t endlearning. Thissetoffusedbeliefmapsf isthenpassed t TohandlemodelsconsistingofamixtureofGaussians, to the next stage and used as an input to guide the 2D re- wefollow[26]andsimplysolveforeachGaussianindepen- estimation of joint locations, instead of the belief maps b t dentlyandselectthemostprobablysolution. usedbyconvolutionalposemachines. Directions Discussion Eating Greeting Phoning Photo Posing Purchases LinKDE[15] 132.71 183.55 132.37 164.39 162.12 205.94 150.61 171.31 Lietal.[21] - 136.88 96.94 124.74 - 168.68 - - Tekinetal.[36] 102.39 158.52 87.95 126.83 118.37 185.02 114.69 107.61 Tekinetal.[34] - 129.06 91.43 121.68 - 162.17 - - Tekinetal.[35] 85.03 108.79 84.38 98.94 119.39 95.65 98.49 93.77 Zhouetal.[48] 87.36 109.31 87.05 103.16 116.18 143.32 106.88 99.78 Sanzarietal.[30] 48.82 56.31 95.98 84.78 96.47 105.58 66.30 107.41 Ours-SinglePPCAModel 68.55 78.27 77.22 89.05 91.63 110.05 74.92 83.71 Ours-MixturePPCAModel 64.98 73.47 76.82 86.43 86.28 110.67 68.93 74.79 Sitting SittingDown Smoking Waiting WalkDog Walking WalkTogether Average LinKDE[15] 151.57 243.03 162.14 170.69 177.13 96.60 127.88 162.14 Lietal.[21] - - - - 132.17 69.97 - - Tekinetal.[36] 136.15 205.65 118.21 146.66 128.11 65.86 77.21 125.28 Tekinetal.[34] - - - - 130.53 65.75 - - Tekinetal.[35] 73.76 170.4 85.08 116.91 113.72 62.08 94.83 100.08 Zhouetal.[48] 124.52 199.23 107.42 118.09 114.23 79.39 97.70 113.01 Sanzarietal.[30] 116.89 129.63 97.84 65.94 130.46 92.58 102.21 93.15 Ours-SinglePPCAModel 115.94 185.72 88.25 88.73 92.37 76.48 77.95 92.96 Ours-MixturePPCAModel 110.19 173.91 84.95 85.78 86.26 71.36 73.14 88.39 Table 1: A comparison of our approach on the Human3.6M dataset against competitors that follow Protocol 1 for evaluation. We substantiallyoutperformallothermethodsintermsofaverageerrorshowinga4.7mmaverageimprovementoverourclosestcompetitor. Notethatsomeapproaches[36,48]usevideoasinputinsteadofasingleframe. Averageerror curate 3D human poses [15]. This is a video and mocap datasetof5femaleand6malesubjects,capturedfrom4dif- Yasinetal.[44] 108.3 ferentviewpoints,thatshowthemperformingtypicalactiv- Rogezetal.[29] 88.1 ities(talkingonthephone,walking,greeting,eating,etc.). Ours-MixturePPCAModel 70.7 2DEvaluation:Figure5showshowthe2Dpredictionsare Table 2: A comparison of our approach on the Human3.6M improved by our projected pose model, reducing the over- datasetagainstcompetitorsthatfollowProtocol2forevaluation. allmeanerrorperlandmark. The2Derrorreductionusing our full approach over the estimates of [43] is comparable in magnitude to the improvement due to the change of ar- Averageerror chitecture moving from the work Zhou et al. [48] to the Bogoetal.[7] 82.3 state-of-the-art2darchitecture[43](i.e. areductionof0.59 Ours-MixturePPCAModel 79.6 pixelsvs. 0.81pixels). SeeTable4fordetails. 3D Evaluation: Several evaluation protocols have been Table 3: A comparison of our approach on the Human3.6M followed by different authors to measure the performance dataset against Bogo et al. [7] who follow Protocol 3 described of their 3D pose estimation methods on the Human3.6M inSection6. Thecomparisonisdoneusingthesamesetofjoints dataset. Tables1,2and 3showcomparisonsoftheperfor- asBogoetal.[7](14ratherthan17). manceofour3Dposeestimationwithpreviouslypublished works,wherewetakecaretoevaluateusingtheappropriate protocolineachcase. Finallifting:Thebeliefmapsproducedastheoutputofthe Protocol1,themoststandardevaluationprotocolonHu- final stage (t = 6) are then lifted into 3D to give the final man3.6Mwasfollowedby[15,21,36,34,35,48,30]. The estimate for the pose (see figure 1) using our algorithm to trainingsetconsistsof5subjects(S1,S5,S6,S7,S8),while lift2Dposesinto3D. thetestsetincludes2subjects(S9,S11).Theoriginalframe rateof50fpsisdownsampledto10fpsandtheevaluationis 6.Experimentalevaluation onsequencescomingfromall4camerasandalltrials. The Human3.6Mdataset: Ourmodelwastrainedandtested reportederrormetricisthe3Derror whichcorrespondsto on the Human3.6M dataset consisting of 3.6 million ac- theaverageEuclideandistanceoftheestimated3Djointsto 2Dpixelerror thegroundtruth. Theerrorisaveragedoverall17jointsof theHuman3.6Mskeletalmodel. Table1showsacompari- Zhouetal.[48] 10.85 sonbetweenourapproachandothercompetingapproaches TrainedCPM[43]architecture 10.04 thatuseProtocol1forevaluation. Oursincluding3Drefinement 9.47 Ourbaselinemethodusingasingleunimodalprobabilis- tic PCA model outperforms almost every method in most Table4:2Derrorsofrelatedwork.Ourapproachofliftingthe2D actiontypes,withtheexceptionofSanzarietal.[30],which landmarkpredictionsintoaplausible3Dmodelandthenprojected backintoimagesubstantiallyreducestheerror.Notethat[48]use itstilloutperformsonaverageacrosstheentiredataset. Our videoasinputandknowledgeoftheactionlabel. mixture model improves on this again, offering a 4.76mm improvementoverSanzarietal.,ourclosestcompetitor. 7.Conclusion Protocol2,followedby[44,29],selects6subjects(S1, Wehavepresentedanovelapproachtohuman3Dpose S5,S6,S7,S8andS9)fortrainingandsubjectS11fortest- estimation from a single image that outperforms previous ing.Theoriginalvideoisdownsampledtoevery64thframe solutions. To our knowledge we are the first to approach and evaluation is performed on sequences from all 4 cam- this as a problem of iterative refinement in which 3D pro- erasandalltrials. Theerrormetricreportedinthiscaseis posalshelprefineandimproveuponthe2Destimates. Our the3Dposeerrorequivalenttotheper-joint3Derrorupto approachshowstheimportanceofthinkingin3Devenfor a similarity transformation (i.e. each estimated 3D pose is 2Dposeestimationwithinasingleimage,withouriterative alignedwiththegroundtruthpose,onaper-framebasis,us- 3Dmodeldemonstratingbetter2DaccuracythanConvolu- ingProcrustesanalysis).Theerrorisaveragedoverasubset tional Pose Machines [43], the iterative 2D approach it is of 14 joints. Table 2 shows a comparison between our ap- basedupon. proachandothercompetingapproachesthatuseProtocol2 Ournovelapproachforupgradingfrom2Dto3Disex- forevaluation. tremelyefficient, andrunsinCPU-basedpythonataround Although in this specific case, our model had been 1,000 frames a second, while a GPU-based real-time ap- trainedusingonlythe5subjectsusedfortraininginProto- proach for Convolutional Pose Machines has been an- col1(onefewersubject),ourmodelstilloutperformsboth nounced. Integrating these two systems to provide a reli- othermethods[44,29]. ablereal-time3Dposeestimatorseemslikeanaturalfuture Protocol3,followedby[7]selectsthesamesubjectsfor direction, as does integrating our approach with a simpler training and testing as Protocol 1. However, evaluation is 2Dapproachforreal-timeposeestimationonlowerpower only on sequences captured from the frontal camera (cam devices. 3) from trial 1 and the original video is not sub-sampled. The error metric used in this case is the 3D pose error as References describedinProtocol2.Theerrorisaveragedover14joints. Table3showsacomparisonbetweenourapproachand[7]. [1] A.AgarwalandB.Triggs. Recovering3dhumanposefrom OurmethodoutperformsBogoetal.[7]byalmost3mm monocular images. IEEE transactions on pattern analysis onaverage. Itisimportanttoremindthereaderthat,unlike andmachineintelligence,28(1):44–58,2006. 2 ourapproach,Bogoetal.[7]exploitsahigh-qualitydetailed [2] S. Agarwal, K. Mierle, and Others. Ceres solver. http: //ceres-solver.org. 4 statistical 3Dbody model [22] trainedon thousands of 3D [3] I.AkhterandM.J.Black. Pose-conditionedjointanglelim- bodyscans,thatcapturesboththevariationofhumanbody itsfor3dhumanposereconstruction. InProceedingsofthe shapeanditsdeformationthroughpose. IEEEConferenceonComputerVisionandPatternRecogni- tion,pages1446–1455,2015. 2 MPII and LEEDS dataset: Our proposed approach [4] I. Akhter, Y. Sheikh, S. Khan, and T. Kanade. Trajectory trainedexclusivelyontheHuman3.6Mdatasetcanbeused space:Adualrepresentationfornonrigidstructurefrommo- to identify 2D and 3D landmarks of images contained in tion. IEEE Transactions on Pattern Analysis and Machine Intelligence,2011. 2 different datasets. Figure 7 shows some qualitative results [5] M.Andriluka,L.Pishchulin,P.Gehler,andB.Schiele. Hu- on the MPII dataset [5], whereas Figure 8 shows qualita- man pose estimation: New benchmark and state of the art tive results on the LEEDS dataset [17], including failure analysis. InIEEEConferenceonComputerVisionandPat- cases. Noticehowourprobabilistic3Dposemodelgener- ternRecognition(CVPR),2014. 9,10 atesanatomicallyplausibleposeseventhoughthe2Dland- [6] C.Barro´nandI.A.Kakadiaris. Estimatinganthropometry mark estimations are not all correct. However, as shown andposefromasingleuncalibratedimage.ComputerVision in bottom row, even small errors in 2D pose can lead to andImageUnderstanding,81(3):269–284,2001. 2 drasticallydifferent3Dposes. Theseinaccuraciescouldbe [7] F. Bogo, A. Kanazawa, C. Lassner, P. Gehler, J. Romero, mitigatedwithoutfurther3Ddatabyannotatingadditional andM.J.Black. Keepitsmpl: Automaticestimationof3d RGBimagesfortrainingfromdifferentdatasets. human pose and shape from a single image. In European Successcases Failurecases Figure7: ResultsonimagesfromtheMPII[5]. Ourmodelwasnottrainedonimagesasdiverseasthosecontainedinthis dataset, howeveritoftenretrievescorrect2Dand3Djointpositions. Thelastrowshowsexamplecaseswherethemethod failseitherintheidentificationof2Dor3Dlandmarks. Conference on Computer Vision, pages 561–578. Springer, [12] A.ElgammalandC.Lee. Inferring3dbodyposefromsil- 2016. 3,8,9 houettes using activity manifold learning. In CVPR, 2004. 2 [8] C. Bregler, A. Hertzmann, and H. Biermann. Recovering non-rigid3dshapefromimagestreams. InComputerVision [13] X.Fan,K.Zheng,Y.Zhou,andS.Wang. Poselocalitycon- andPatternRecognition, 2000.Proceedings.IEEEConfer- strainedrepresentationfor3dhumanposereconstruction. In enceon,volume2,pages690–696.IEEE,2000. 2,4 EuropeanConferenceonComputerVision,pages174–188. Springer,2014. 2 [9] X.ChenandA.L.Yuille. Articulatedposeestimationbya graphicalmodelwithimagedependentpairwiserelations.In [14] P. Gotardo and A. Martinez. Computing smooth time tra- AdvancesinNeuralInformationProcessingSystems,pages jectoriesforcameraanddeformableshapeinstructurefrom 1736–1744,2014. 3 motionwithocclusion. IEEETransactionsonPatternAnal- ysisandMachineIntelligence,2011. 2 [10] J. Cho, M. Lee, and S. Oh. Complex non-rigid 3d shaperecoveryusingaprocrusteannormaldistributionmix- [15] C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu. ture model. International Journal of Computer Vision, Human3. 6m: Large scale datasets and predictive meth- 117(3):226–246,2016. 2 ods for 3d human sensing in natural environments. IEEE transactions on pattern analysis and machine intelligence, [11] C. H. Ek, P. H. S. Torr, and N. D. Lawrence. Gaussian 36(7):1325–1339,2014. 4,8 process latent variable models for human pose estimation. In A. Popescu-Belis, S. Renals, and H. Bourlard, editors, [16] A.Jain,J.Tompson,M.Andriluka,G.W.Taylor,andC.Bre- MLMI,volume4892ofLectureNotesinComputerScience, gler. Learninghumanposeestimationfeatureswithconvo- pages132–143.Springer,2007. 2 lutionalnetworks. arXivpreprintarXiv:1312.7302,2013. 3

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.