ebook img

Space-Time Representation of People Based on 3D Skeletal Data: A Review PDF

2.8 MB·
by  Fei Han
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Space-Time Representation of People Based on 3D Skeletal Data: A Review

1 Space-Time Representation of People Based on 3D Skeletal Data: A Review Fei Han∗, Brian Reily∗, William Hoff, and Hao Zhang Abstract—Spatiotemporalhumanrepresentationbasedon3Dvisualperceptiondataisarapidlygrowingresearcharea. Representationscanbebroadlycategorizedintotwogroups,dependingonwhethertheyuseRGB-Dinformationor3Dskeletondata. Recently,skeleton-basedhumanrepresentationshavebeenintensivelystudiedandkeptattractinganincreasingattention,duetotheir robustnesstovariationsofviewpoint,humanbodyscaleandmotionspeedaswellastherealtime,onlineperformance.Thispaper presentsacomprehensivesurveyofexistingspace-timerepresentationsofpeoplebasedon3Dskeletaldata,andprovidesan informativecategorizationandanalysisofthesemethodsfromtheperspectives,includinginformationmodality,representation 6 encoding,structureandtransition,andfeatureengineering.Wealsoprovideabriefoverviewofskeletonacquisitiondevicesand 1 constructionmethods,enlistanumberofbenchmarkdatasetswithskeletondata,anddiscusspotentialfutureresearchdirections. 0 2 IndexTerms—Humanrepresentation,skeletondata,3Dvisualperception,space-timefeatures,survey n (cid:70) a J 1 2 1 INTRODUCTION Moreover,thevastincreaseincomputationalpowerallows V] HUmanrepresentationinspatiotemporalspaceisafun- researchers to develop advanced computational algorithms (e.g.,deeplearning[22])toprocessvisualdataatanaccept- damentalresearchproblemextensivelyinvestigatedin C able speed. The advancements contribute to the boom of computervisionandmachineintelligenceoverthepastfew . utilizing3Dperceptiondatatoconstructreasoningsystems s decades. The objective of building human representations c istoextractcompact,descriptiveinformation(i.e.,features) incomputervisionandmachinelearningcommunities. [ Since the performance of machine learning and reason- to encode and characterize a human’s attributes from per- 2 ception data (e.g., human shape, pose, and motion), when ingmethodsheavilyreliesonthedesignofdatarepresenta- v developing recognition or other human-centered reasoning tion[23],humanrepresentationsareintensivelyinvestigated 6 toaddresshuman-centeredresearchproblems(e.g.,human systems. As an integral component of reasoning systems, 0 detection,tracking,poseestimation,andactionrecognition). approaches to construct human representations have been 0 Amongalargenumberofhumanrepresentationapproaches widely usedin avariety ofreal-world applications,includ- 1 [24],[25],[26],[27],[28],[29],mostoftheexisting3Dbased 0 ingvideoanalysis[1],surveillance[2],robotics[3],human- methods can be broadly grouped into two categories: rep- . machine interaction [4], augmented and virtual reality [5], 1 resentations based on local features [30], [31] and skeleton- assistiveliving[6],smarthomes[7],education[8],andmany 0 basedrepresentations[32][33][34].Methodsbasedonlocal 6 others[9],[10],[11]. 1 Duringrecentyears,humanrepresentationsbasedon3D features detect points of interest in space-time dimensions, : perception data have been attracting an increasing amount describe the patches centered at the points as features, and v of attention [12], [13], [14], [15]. Comparing with 2D visual encodethem(e.g.,usingbag-of-wordmodels)intorepresen- i X data,additionaldepthinformationprovidesseveraladvan- tations, which can locate salient regions and are relatively r tages.Depthimagesprovidegeometricinformationofpixels robusttopartialocclusion.However,methodsbasedonlo- a that encode the external surface of the scene in 3D space. calfeaturesignorespatialrelationshipsamongthefeatures. Features extracted from depth images and 3D point clouds Theseapproachesareoftenincapableofidentifyingfeature are robust to variations of illumination, scale, and rotation affiliations, and thus the methods are generally incapable [16],[17].Thankstotheemergenceofaffordablestructured- to represent multiple individuals in the same scene. These light color-depth sensing technology, such as the Microsoft methodsarealsocomputationallyexpensivebecauseofthe Kinect[18]andAsusXtionPROLIVE[19]RGB-Dcameras, complexityoftheproceduresincludingkeypointdetection, it is much easier and cheaper to obtain depth data. In ad- featuredescription,dictionaryconstruction,etc. dition,structured-lightcamerasenableustoretrievethe3D Ontheotherhand,humanrepresentationsbasedon3D humanskeletalinformationinrealtime[20],whichusedto skeleton information provide a very promising alternative. beonlypossiblewhenusingexpensiveandcomplexvision The concept of skeleton-based representation can be traced systems(e.g.,motioncapturesystems[21]),therebysignifi- backtotheearlyseminalresearchofJohansson[35],which cantly popularizing skeleton-based human representations. demonstrated that a small number of joint positions can effectively represent human behaviors. 3D skeleton-based • F.Han∗,B.Reily∗,W.Hoff,andH.ZhangarewiththeDepartmentof representationsalsodemonstratepromisingperformancein ElectricalEngineeringandComputerScience,ColoradoSchoolofMines, real-world applications including Kinect-based gaming, as Golden,CO,80401.∗Theseauthorscontributedequallytothiswork, well as in computer vision research [22], [36]. 3D skeleton- E-mail:{fhan,breily,whoff,hzhang}@mines.edu based representations are able to model the relationship of 2 human joints and encode the whole body configuration. • Thispaperprovidesaninsightfulcategorizationand Theyarealsorobusttoscaleandilluminationchanges,and analysisofthe3Dskeleton-basedrepresentationcon- can be invariant to camera view as well as human body structionapproachesfrommultipleperspectives,and rotation and motion speed. In addition, many skeleton- summarizesandcomparesattributesofallreviewed based representations can be computed at a high frame representations. rate, which can significantly facilitate online, real-time ap- The remainder of this review is structured as follows. plications. Given the advantages and previous success of Background information including 3D skeleton acquisition 3D skeleton-based representations, we have witnessed a and construction as well as public benchmark datasets is significant increase of new techniques to construct such presentedinSection2.Sections3to6discussthecategoriza- representations in recent years, as demonstrated in Fig. 1, tionof3Dskeleton-basedhumanrepresentationsfromfour which underscores the need of this survey paper focusing perspectives, including information modality in Section 3, onthereviewof3Dskeleton-basedhumanrepresentations. encoding in Section 4, hierarchy and transition in Section 5,andfeatureconstructionmethodologyinSection6.After 80 discussingtheadvantagesofskeleton-basedrepresentations rs 60 e andpointingoutfutureresearchdirectionsinSection7,the p Pa 40 reviewpaperisconcludedinSection8. # 20 0 2007 2008 2009 2010 2011 2012 2013 2014 2 BACKGROUND The objective of building 3D skeleton-based human repre- Fig.1.Numberof3Dskeleton-basedhumanrepresentationspublished sentationsistoextractcompact,discriminativedescriptions inrecentyearsaccordingtoourcomprehensivereview. tocharacterizeahuman’sattributesfrom3Dhumanskeletal Severalsurveypaperswerepublishedinrelatedresearch information. The 3D skeleton data encodes human body as areassuchasmotionandactivityrecognition.Forexample, anarticulatedsystemofrigidsegmentsconnectedbyjoints. Han et al. [17] described the Kinect sensor and its general Thissectiondiscusseshow3Dskeletaldatacanbeacquired, application in computer vision and machine intelligence. includingdevicesthatdirectlyprovidetheskeletaldataand AggarwalandXia[16]recentlypublishedareviewpaperon computationalmethodstoconstructtheskeleton.Available humanactivityrecognitionfrom3Dvisualdata,whichsum- benchmark datasets including 3D skeleton information are marized five categories of representations based on 3D sil- alsosummarizedinthissection. houettes,skeletaljointsorbodypartlocations,localspatio- temporalfeatures,sceneflowfeatures,andlocaloccupancy features. Several earlier surveys were also published to 2.1 DirectAcquisitionof3DSkeletalData review methods to recognize human poses, motions, ges- Several commercial devices, including motion capture sys- tures,andactivities[37],[38],[39],[40],[41],[42],[43],[44], tems, time-of-flight sensors, and structured-light cameras, [45], [46], [47], [48], as well as their applications [49], [50]. allowfordirectretrievalof3Dskeletondata.The3Dskeletal However,noneofthesurveypapersspecificallyfocusedon kinematichumanbodymodelsprovidedbythedevicesare 3Dhumanrepresentationbasedonskeletaldata,whichwas illustratedinFig.2. thesubjectofnumerousresearchpapersintheliteratureand continuestogainpopularityinrecentyears. 2.1.1 MotionCaptureSystems(MoCap) The objective of this survey is to provide a comprehen- Motion capture systems identify and track markers that sive overview of 3D skeleton-based human representations are attached to a human subject’s joints or body parts to published in the computer vision and machine intelligence obtain 3D skeleton information. There are two main cate- communities.Wecategorizeandcomparethereviewedap- gories of MoCap systems, based on either visual cameras proachesfrommultipleperspectives,includinginformation or inertia sensors. Optical-based systems employ multiple modality, representation coding, structure and transition, cameras positioned around a subject to track, in 3D space, and feature engineering methodology,and analyze the pros reflective markers attached to the human body. In MoCap and cons of each category. A comprehensive review on systemsbasedoninertialsensors,each3-axisinertialsensor methods to acquire and estimate 3D human skeleton and estimatestherotationofabodypartwithrespecttoafixed a complete list of available benchmark datasets are also point. This information is collected to obtain the skeleton included. Compared with the existing surveys, the main datawithoutanyopticaldevicesaroundasubject.Software contributionsofthisreviewinclude: tocollectskeletondataisprovidedwithcommercialMoCap • Tothebestofourknowledge,thisisthefirstsurvey systems, such as Nexus for Vicon MoCap1, NatNet SDK dedicatedtohumanrepresentationsbasedon3Dskeleton for OptiTrack2, etc. MoCap systems, especially based on data,whichfillsthecurrentvoidintheliterature. multiple cameras, can provide very accurate 3D skeleton • Thesurveyiscomprehensiveandcoversthemostrecent information at a very high speed. On the other hand, such andadvancedapproaches.Wereview1703Dskeleton- systemsaretypicallyexpensiveandcanonlybeusedinwell based human representations, including 149 papers controlledindoorenvironments. thatwerepublishedintherecentfiveyears,thereby providingreaderswiththecomplete,state-of-the-art 1.Vicon:http://www.vicon.com/products/software/nexus. methods. 2.OptiTrack:http://www.optitrack.com/products/natnet-sdk. 3 TABLE1 SummaryofRecentSkeletonConstructionTechniquesBasedonDepthand/orRGBImages. Reference Approach InputData Performance Shottonetal.[20],[51] Pixel-by-pixelclassification Singledepthimage 3Dskeleton,16joints,real-time,200fps Yeetal.[52] Motionexemplars Singledepthimage 3Dskeleton,38mmaccuracy Jungetal.[53] Randomtreewalks Singledepthimage 3Dskeleton,real-time,1000fps Sunetal.[54] Conditionalregressionforests Singledepthimage 3Dskeleton,over80%averageprecision Charlesand Limb-basedshapemodels Singledepthimage 2Dskeleton,robusttoocclusions Everingham[55] Decisiontreeposeletswith 3Dskeleton,onlyneedsmall Holtetal.[56] Singledepthimage pictorialstructuresprior amountoftrainingdata Grestetal.[57] ICPusingoptimizedJacobian Singledepthimage 3Dskeleton,over10fps 3Dskeleton,20joints,real-time,100fps, Baaketal.[58] Matchingpreviousjointpositions Singledepthimage robusttosensornoiseandocclusions Singledepthimageand Tayloretal.[59] Regressiontopredictcorrespondences 3Dskeleton,19joints,real-time,120fps multiplesilhouetteimages Zhuetal.[60] ICPonindividualparts Depthimagesequence 3Dskeleton,10fps,robusttoocclusion 3Dskeleton,real-time,125fps, Ganapathietal.[61] ICPwithphysicalconstraints Depthimagesequence robusttoselfcollision Plagemannetal.[62], HaarfeaturesandBayesianprior Depthimagesequence 3Dskeleton,real-time Ganapathietal.[25] 3Dnon-rigidmatchingbasedon Zhangetal.[63] Depthimagesequence 3Dskeleton MRFdeformationmodel Schwarzetal.[64] Geodesicdistance&opticalflow DepthandRGBimagestreams 3Dskeleton,16joints,robusttoocclusions 3Dskeleton,robusttoviewpoint Wangetal.[65] Recurrent2D/3Dposeestimation SingleRGBimages changesandocclusions Fanetal.[66] Dual-sourcedeepCNN SingleRGBimages 2Dskeleton,robusttoocclusions ToshevandSzegedy[67] Deepneuralnetworks SingleRGBimages 2Dskeleton,robusttoappearancevariations Dongetal.[68] Parselets/gridlayoutfeature SingleRGBimages 2Dskeleton,robusttoocclusions AkhterandBlack[69] Priorbasedonjointanglelimits SingleRGBimages 3Dskeleton Tompsonetal.[70] CNN/Markovrandomfield SingleRGBimages 2Dskeleton,closetoreal-time Elhayeketal.[71] ConvNetjointdetector Multi-perspectiveRGBimages 2Dskeleton,nearly95%accuracy Galletal.[72], Skeletontrackingand 3Dskeleton,dealwithrapidmovements Multi-perspectiveRGBimages Liuetal.[73] surfaceestimation andapparellikeskirts [75], and the OpenKinect library [76]. The Kinect SDK also provides 3D human skeletal data using the method described by Shotton et.al [77]. OpenNI uses NITE [78] – a skeleton generation framework developed as proprietary software by PrimeSense, to generate a similar 3D human skeleton model. Markers are not necessary for structured- lightsensors.Theyarealsoinexpensiveandcanprovide3D skeleton information in real time. On the other hand, since structured-light cameras are based on infrared light, they can only work in an indoor environment. The frame rate Fig.2. Examplesofskeletalhumanbodymodelsobtainedfromdifferent (30 Hz) and resolution of depth images (320×240) are also devices.TheOpenNIlibrarytracks15joints;Kinectv1SDKtracks20 relativelylow. joints;Kinectv2SDKtracks25;andMoCapsystemscantrackvarious numbersofjoints. 2.1.3 Time-of-Flight(ToF)Sensors ToFsensorsareabletoacquireaccuratedepthdataatahigh 2.1.2 Structured-LightCameras frame rate, by emitting light and measuring the amount of Structured-light color-depth sensors are a type of camera time it takes for that light to return – similar in principle that uses infrared light to capture depth information about to established depth sensing technologies, such as radar a scene, such as Microsoft Kinect v1 [18], ASUS Xtion PRO and LiDAR. Compared to other ToF sensors, the Microsoft LIVE[19],andPrimeSense[74],amongothers.Astructured- Kinectv2cameraoffersanaffordablealternativetoacquire light sensor consists of an infrared-light source and a re- depthdatausingthistechnology.Inaddition,acolorcamera ceiverthatcandetectinfraredlight.Thelightprojectoremits isintegratedintothesensortoprovideregisteredcolordata. a known pattern, and the way that this pattern distorts on Thecolor-depthdatacanbeaccessedbytheKinectSDK2.0 the scene allows the camera to decide the depth. A color [79]ortheOpenKinectlibrary(usingthelibfreenect2driver) cameraisalsoavailableonthesensortoacquirecolorframes [76]. The Kinect v2 camera provides a higher resolution of that can be registered to depth frames, thereby providing depth images (512×424) at 30 Hz. Moreover, the camera color-depth information at each pixel of a frame or 3D is able to provide 3D skeleton data by estimating positions color point clouds. Several drivers are available to provide of 25 human joints, with better tracking accuracy than the the access to the color-depth data acquired by the sensor, Kinect v1 sensor. Similar to the first version, the Kinect v2 includingtheMicrosoftKinectSDK[18],theOpenNIlibrary hasaworkingrangeofapproximately0.5to5meters. 4 2.2 3DJointEstimationandSkeletonConstruction Several approaches for whole-skeleton matching are based on the Iterative Closest Point (ICP) method [88], which can Besides manual human skeletal joint annotation [56], [80], iterativelydecidearigidtransformationsuchthattheinput [81], a number of approaches have been designed to auto- querypointsfittothepointsinthegivenmodelunderthis matically construct a skeleton model from perception data. transformation.Usingpointcloudsofapersonwithknown Some of these are based on methods used in RGB imagery, poses as a model, several approaches [57], [60] apply ICP while others take advantage of the extra information avail- to fit the unknown poses by estimating the translation and ableinadepthorRGB-Dimage.Themajorityofthecurrent rotationtofittheunknownbodypartstotheknownmodel. methods are based on body part recognition, and then fit a While these approaches are relatively accurate, they suffer flexible model to the now ‘known’ body part locations. An from several drawbacks. ICP is computationally expensive alternatemainmethodologyisstartingwitha‘known’prior, for a model with as many degrees of freedom as a hu- andfittingthesilhouetteorpointcloudtothispriorafterthe man body. Additionally, it can be difficult to recover from humansarelocalized[31],[82],[83].Thissectionprovidesa tracking loss. Typically the previous pose is used as the brief review of autonomous skeleton construction methods known pose to fit to; if tracking loss occurs and this pose based on visual data according to the information that is becomes inaccurate, then further fitting can be difficult or used. A summary of the reviewed skeleton construction impossible.Finally,skeletonconstructionmethodsbasedon techniquesispresentedinTable1. theICPalgorithmgenerallyrequireaninitialT-posetostart theiterativeprocess. 2.2.1 ConstructionfromDepthImagery Due to the additional 3D geometric information that depth 2.2.2 ConstructionfromRGBImagery imagerycanprovide,manymethodsaredevelopedtobuild a3Dhumanskeletonmodelbasedonasingledepthimage Earlyapproachesandseveralrecentmethodsbasedondeep orasequenceofdepthframes. learningfocusedon2Dor3Dhumanskeletonconstruction Humanjointestimationviabodypartrecognitionisone fromtraditionalRGBorintensityimages,typicallybyiden- popularapproachtoconstructtheskeletonmodel[20],[51], tifyinghumanbodypartsusingvisualfeatures(e.g.,image [53], [54], [55], [56], [62], [64]. A seminal paper by Shotton gradients,deeplylearnedfeatures,etc.),ormatchingknown et al. [20] in 2011 provided an extremely effective skeleton posestoasegmentedsilhouette. constructionalgorithmbasedonbodypartrecognition,that Methodsbasedonasingleimage:Manyalgorithmswere was able to work in real time. A single depth image (inde- proposedtoconstructhumanskeletalmodelusingasingle pendentofpreviousframes)isclassifiedonaper-pixelbasis, colororintensityimageacquiredfromamonocularcamera using a randomized decision forest classifier. Each branch [65],[68],[69],[89].Wangetal.[65]constructsa3Dhuman intheforestisdeterminedbyasimplerelationbetweenthe skeleton from a single image using a linear combination of targetpixelandvariousothers.Thepixelsthatareclassified knownskeletonswithphysicalconstraintsonlimblengths. into the same category form the body part, and the joint Usinga2Dposeestimator[89],thealgorithmbeginswitha is inferred by the mean-shift method from a certain body known2Dposeandamean3Dpose,andcalculatescamera part,usingthedepthdatato‘push’themintothesilhouette. parameters from this estimation. The 3D joint positions are While training the decision forests takes a large number of recalculatedusingtheestimatedparameters,andthecamera images(around1million)aswellasaconsiderableamount parametersareupdated.Thestepscontinueiterativelyuntil ofcomputingpower,thefactthatthebranchesintheforest convergence.Thisapproachwasdemonstratedtoberobust areverysimpleallowsthisalgorithmtogenerate3Dhuman topartialocclusionsanderrorsinthe2Destimation.Donget skeletonmodelswithinabout5ms.Anextendedworkwas al. [68] considered the human parsing and pose estimation publishedin[51],withbothaccuracyandspeedimproved. problemssimultaneously.Theauthorsintroducedaunified Plagemann et al. [62] introduced an approach to recognize frameworkbasedonsemanticpartsusingatailoredAnd-Or bodypartsusingHaarfeatures[84]andconstructaskeleton graph.TheauthorsalsoemployedparseletsandMixtureof modelontheseparts.Usingdataovertime,theyconstructa Joint-GroupTemplatesastherepresentation. Bayesiannetwork,whichproducestheestimatedposeusing Recently,deepneuralnetworkshaveproventheirability body part locations and starts with the previous pose as a in human skeleton construction [66], [67], [70]. Toshev and prior [25]. Holt et al. [56] proposed Connected Poselets to Szegedy[67]employedDeepNeuralNetworks(DNNs)for estimate 3D human pose from depth data. The approach human pose estimation. The proposed cascade of DNN re- utilizestheideaofposelets[85],whichiswidelyappliedfor gressorsobtainsposeestimationresultswithhighprecision. poseestimationfromRGBimages.Foreachdepthimage,a Fanetal.[66]usesDual-SourceDeepConvolutionalNeural multi-scaleslidingwindowisapplied,andadecisionforest Networks(DS-CNNs)forestimating2Dhumanposesfrom isusedtodetectposeletsandestimatejointlocations.Using asingleimage.Thismethodtakesasetofimagepatchesas askeletonpriorinspiredbypictorialstructures[86],[87],the theinputandlearnstheappearanceofeachlocalbodypart methodbeginswithatorsopointandconnectsoutwardsto byconsideringtheirpreviousviewsinthefullbody,which body parts. By applying kinematic inference to eliminate successfullyaddressesthejointrecognitionandlocalization impossibleposes,theyareabletorejectincorrectbodypart issue.Tompsonetal.[70]proposedaunifiedlearningframe- classificationsandimprovetheiraccuracy. work based on deep Convolutional Networks (ConvNets) Another widely investigated methodology to construct andMarkovRandomFields,whichcangenerateaheat-map 3D human skeleton models from depth imagery is based toencodeaper-pixellikelihoodforhumanjointlocalization onnearest-neighbormatching[52],[57],[58],[59],[60],[63]. fromasingleRGBimage. 5 Methods based on multiple images: When multiple im- the Kinect camera at Microsoft Research, which consists agesofahumanareacquiredfromdifferentperspectivesby of subjects performing American Sign Language gestures a multi-camera system, traditional stereo vision techniques and a variety of typical human actions, such as making a can be employed to estimate depth maps of the human. phone call or reading a book. The dataset provides RGB, After obtaining the depth image, a human skeleton model depth, and skeleton information generated by the Kinect can be constructed using methods based on depth infor- v1 camera for each data instance. A large number of ap- mation (Section 2.2.1). Although there exists a commercial proaches used this dataset for evaluation and validation solution that uses marker-less multi-camera systems to ob- [127]. The MSRC-12 Kinect gesture dataset [120], [128] is tain highly precise skeleton data at 120 frames per second oneofthelargestgesturedatabasesavailable.Consistingof (FPS) and approximately 25-50ms latency [90], computing nearly seven hours of data and over 700,000 frames of a depthmapsisusuallyslowandoftensuffersfromproblems varietyofsubjectsperformingdifferentgestures,itprovides such as failures of correspondence search and noisy depth the pose estimation and other data that was recorded with information. To address these problems, algorithms were a Kinect v1 camera. The Cornell Activity Dataset (CAD) also studied to construct human skeleton models directly includes CAD-60 [117] and CAD-120 [110], which contains fromthemulti-imageswithoutcalculatingthedepthimage 60and120RGB-Dvideosofhumandailyactivities,respec- [71], [72], [73]. For example, Gall et al. [72] introduced an tively. The dataset was recorded by a Kinect v1 in different approach to fully-automatically estimate the 3D skeleton environments, such as an office, bedroom, kitchen, etc. model from a multi-perspective video sequence, where an The SBU-Kinect-Interaction dataset [123] contains skeleton articulated template model and silhouettes are obtained data of a pair of subjects performing different interaction from the sequence. Another method was also proposed by activities - one person acting and the other reacting. Many Liu et al. [73], which uses a modified global optimization otherdatasetscapturedusingaKinectv1camerawerealso methodtohandleocclusions. released to the public, including the MSR Daily Activity 3D [121], MSR Action Pairs [113], Online RGBD Action (ORGBD)[107],UTKinect-Action[124],Florence3D-Action 2.3 BenchmarkDatasetsWithSkeletalData [118], CMU-MAD [104], UTD-MHAD [103], G3D/G3Di Inthepastfiveyears,alargenumberofbenchmarkdatasets [105], [119], SPHERE [108], ChaLearn [111], RGB-D Person containing3Dhumanskeletondatawerecollectedindiffer- Re-identification[122],Northwestern-UCLAMultiviewAc- entscenariosandmadeavailabletothepublic.Thissection tion3D[106],Multiview3DEvent[114],CDC4CVpose[56], providesacompletereviewofthedatasetsaslistedinTable SBU-Kinect-Interaction [123], UCF-Kinect [115], SYSU 3D 2.Wecategorizeanddiscussthesedatasetsaccordingtothe Human-Object Interaction [100], Multi-View TJU [99], M2I typeofdevicesusedtoacquiretheskeletoninformation. [98],and3DIconicGesture[116]datasets.Thecompletelist ofhuman-skeletondatasetscollectedusingstructured-light 2.3.1 DatasetsCollectedUsingMoCapSystems camerasarepresentedinTable2. Early3Dhumanskeletondatasetswereusuallycollectedby aMoCapsystem,whichcanprovideaccuratelocationsofa 2.3.3 DatasetsCollectedbyOtherTechniques various number of skeleton joints by tracking the markers attachedonhumanbody,typicallyinindoorenvironments. BesidesthedatasetscollectedbyMoCaporstructured-light TheCMUMoCapdataset[91]isoneoftheearliestresources cameras,additionaltechnologieswerealsoappliedtocollect that consists of a wide variety of human actions, including datasetscontaining3Dhumanskeletoninformation,suchas interactionbetweentwosubjects,humanlocomotion,inter- multiplecamerasystems,ToFcamerassuchastheKinectv2 actionwithuneventerrain,sports,andotherhumanactions. camera,orevenmanualannotation. The recent Human3.6M dataset [92] is one of the largest Due to the low price and improved performance of MoCapdatasets,whichconsistsof3.6millionhumanposes the Kinect v2 camera, it has become increasingly widely andcorrespondingimagescapturedbyahigh-speedMoCap adoptedtocollect3Dskeletondata.TheTelecommunication system. It contains activities by 11 professional actors in 17 Systems Team (TST) created a collection of datasets using scenarios:discussion,smoking,takingphoto,talkingonthe Kinect v2 ToF cameras, which include three datasets for phone, etc., as well as provides accurate 3D joint positions differentpurposes.TheTSTfalldetectiondataset[109]con- andhigh-resolutionvideos.ThePosePriordataset[69]isthe tains eleven different subjects performing falling activities newestMoCapdatasetthatincludesanextensivevarietyof and activities of daily living in a variety of scenarios; The human stretching poses performed by trained athletes and TST TUG dataset [102] contains twenty different individu- gymnasts. Many other MoCap datasets were also released, als standing up and walking around; and the TST intake including the Pictorial Human Spaces [93], CMU Multi- monitoring dataset contains food intake actions performed Modal Activity (CMU-MMAC) [94] Berkeley MHAD [95], by35subjects[101]. Standford ToFMCD [25], HumanEva-I [96], and HDM05 Manual annotation approaches are also widely used to MoCap[97]datasets. provideskeletondata.TheKTHMultiviewFootballdataset [112]containsimagesofprofessionalfootballplayersduring 2.3.2 DatasetsCollectedbyStructured-LightCameras realmatches,whichareobtainedusingcolorsensorsfrom3 Affordablestructured-lightcamerasarewidelyusedfor3D views.Thereare14annotatedjointsforeachframe.Several human skeleton data acquisition. Numerous datasets were otherskeletondatasetsarecollectedbasedonmanualanno- collected using the Kinect v1 camera in different scenarios. tation,includingtheLSPdataset[81],andtheTUMKitchen The MSR Action3D dataset [121], [126] was captured using dataset[125],etc. 6 TABLE2 PubliclyAvailableBenchmarkDatasetsProviding3DHumanSkeletonInformation. ReleaseYear DatasetandReference Acquisitiondevice OtherDataSource Scenario 2015 M2I[98] Kinectv1 RGB+depth humandailyactivities 2015 Multi-ViewTJU[99] Kinectv1 RGB+depth humandailyactivities 2015 PosePrior[69] MoCap color extrememotions 2015 SYSU3DHOI[100] Kinectv1 color+depth humandailyactivities 2015 TSTIntakeMonitoring[101] Kinectv2+IMU depth humandailyactivities 2015 TSTTUG[102] Kinectv2+IMU depth humandailyactivities 2015 UTD-MHAD[103] Kinectv1+IMU RGB+depth atomicactions 2014 CMU-MAD[104] Kinectv1 RGB+depth atomicactions 2014 G3Di[105] Kinectv1 RGB+depth gaming 2014 Human3.6M[92] MoCap color movies 2014 Northwestern-UCLAMultiview[106] Kinectv1 RGB+depth humandailyactivities 2014 ORGBD[107] Kinectv1 RGB+depth human-objectinteractions 2014 SPHERE[108] Kinect depth humandailyactivities 2014 TSTFallDetection[109] Kinectv2+IMU depth humandailyactivities 2013 BerkeleyMHAD[95] MoCap RGB+depth humandailyactivities 2013 CAD-120[110] Kinectv1 RGB+depth humandailyactivities 2013 ChaLearn[111] Kinectv1 RGB+depth Italiangestures 2013 KTHMultiviewFootball[112] 3cameras color professionalfootballactivities 2013 MSRActionPairs[113] Kinectv1 RGB+depth activitiesinpairs 2013 Multiview3DEvent[114] Kinectv1 RGB+depth indoorhumanactivities 2013 PictorialHumanSpaces[93] MoCap color humandailyactivities 2013 UCF-Kinect[115] Kinectv1 color humandailyactivities 2012 3DIG[116] Kinectv1 color+depth iconicgestures 2012 CAD-60[117] Kinectv1 RGB+depth humandailyactivities 2012 Florence3D-Action[118] Kinectv1 color humandailyactivities 2012 G3D[119] Kinectv1 RGB+depth gaming 2012 MSRC-12Gesture[120] Kinectv1 N/A gaming 2012 MSRDailyActivity3D[121] Kinectv1 RGB+depth humandailyactivities 2012 RGB-DPersonRe-identification[122] Kinectv1 RGB+3Dmesh personre-identification 2012 SBU-Kinect-Interaction[123] Kinectv1 RGB+depth humaninteractionactivities 2012 UTKinectAction[124] Kinectv1 RGB+depth atomicactions 2011 CDC4CVpose[56] Kinectv1 depth basicactivities 2010 HumanEva[96] MoCap color humandailyactivities 2010 MSRAction3D[121] Kinectv1 depth gaming 2010 StanfordToFMCD[25] MoCap+ToFsensor depth humandailyactivities 2009 TUMkitchen[125] 4cameras color manipulationactivities 2008 CMU-MMAC[94] MoCap color cookinginkitchen 2007 HDM05MoCap[97] MoCap color humandailyactivities 2001 CMUMoCap[91] MoCap N/A gaming+sports+movies 3 INFORMATION MODALITY jointsin3Dspace,whichareacquiredfromthesameframe atatimepoint. Skeleton-basedhumanrepresentationsareconstructedfrom variousfeaturescomputedfromraw3Dskeletaldata,where The pairwise relative position of human skeleton joints eachfeaturesourceiscalledamodality.Fromtheperspective isthemostwidelystudieddisplacementfeatureforhuman of information modality, 3D skeleton-based human repre- representation[121][129][130][132][136][138].Withinthe sentations can be classified into four categories based on sameskeletonmodelobtainedatatimepoint,foreachjoint jointdisplacement,orientation,rawposition,andcombined p=(x,y,z)in3Dspace,thedifferencebetweenthelocation information. Existing approaches falling in each categories ofjointiandjointjiscalculatedbypij =pi−pj,i(cid:54)=j.The aresummarizedindetailinTables3–6,respectively. jointlocationspareoftennormalized,sothatthefeatureis invarianttotheabsolutebodyposition,initialbodyorienta- tionandbodysize[121],[129],[130].ChenandKoskela[132] 3.1 Displacement-BasedRepresentations implemented a similar feature extraction method based on Features extracted from displacements of skeletal joints are pairwiserelativepositionofskeletonjointswithnormaliza- (cid:80) widelyappliedinmanyskeleton-basedrepresentationsdue tion calculated by (cid:107)p −p (cid:107)/ (cid:107)p −p (cid:107),i (cid:54)= j, which i j i(cid:54)=j i j to the simple structure and easy implementation. They use isillustratedinFig.3(a). informationfromthedisplacementofskeletaljoints,which Another group of joint displacement features extracted can either be the displacement between different human fromthesameframeforskeleton-basedrepresentationcon- jointswithinthesameframeorthedisplacementofthesame struction is based on the difference to a reference joint. In jointacrossdifferenttimeperiods. thesefeatures,thedisplacementsareobtainedbycalculating the coordinate difference of all joints with respect to a ref- 3.1.1 SpatialDisplacementBetweenJoints erence joint, usually manually selected. Given the location Representations based on relative joint displacements com- of a joint (x,y,z) and a given reference joint (xc,yc,zc) putespatialdisplacementsofcoordinatesofhumanskeletal in the world coordinate system, Rahmani et al. [133] de- 7 TABLE3 Summaryof3DSkeleton-BasedRepresentationsBasedonJointDisplacementFeatures. Notation:Inthefeatureencodingcolumn:Concatenation-basedencoding,Statistics-basedencoding,Bag-of-Wordsencoding.Inthestructureand transitioncolumn:Low-levelfeatures,Bodypartsmodels,Manifolds;Inthefeatureengineeringcolumn:Hand-craftedfeatures,Dictionary learning,Unsupervisedfeaturelearning,Deeplearning.Intheremainingcolumns:‘T’indicatesthattemporalinformationisusedinfeature extraction;‘VI’standsforView-Invariant;‘ScI’standsforScale-Invariant;‘SpI’standsforSpeed-Invariant;‘OL’standsforOnLine;‘RT’standsfor Real-Time. Feature Structure Feature Reference Approach T VI ScI SpI OL RT Encoding andTransition Engineering Huetal.[100] JOULE BoW Lowlv Unsup (cid:88) (cid:88) (cid:88) Wangetal.[106] CrossView BoW Body Dict (cid:88) (cid:88) Weietal.[114] 4DInteraction Conc Lowlv Hand (cid:88) (cid:88) (cid:88) Ellisetal.[115] LatencyTrade-off Conc Lowlv Hand (cid:88) (cid:88) (cid:88) (cid:88) Wangetal.[121],[129] Actionlet Conc Lowlv Hand (cid:88) (cid:88) (cid:88) Barbosaetal.[122] Soft-biometricsFeature Conc Body Hand Xiaetal.[124] Hist.of3DJoints Stat Lowlv Hand (cid:88) (cid:88) Yunetal.[123] Joint-to-PlaneDistance Conc Lowlv Hand (cid:88) (cid:88) (cid:88) YangandTian[130],[131] EigenJoints Conc Lowlv Unsup (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) ChenandKoskela[132] PairwiseJoints Conc Lowlv Hand (cid:88) (cid:88) (cid:88) Rahmanietal.[133] JointMovementVolumes Stat Lowlv Hand (cid:88) Luoetal.[134] SparseCoding BoW Lowlv Dict (cid:88) (cid:88) Jiangetal.[135] HierarchicalSkeleton BoW Lowlv Hand (cid:88) (cid:88) (cid:88) (cid:88) YaoandLi[136] 2.5DGraphRepresentation BoW Lowlv Hand (cid:88) (cid:88) VantigodiandBabu[137] VarianceofJoints Stat Lowlv Hand (cid:88) (cid:88) Zhaoetal.[138] MotionTemplates BoW Lowlv Dict (cid:88) (cid:88) (cid:88) (cid:88) Yaoetal.[139] CoupledRecognition Conc Lowlv Hand (cid:88) Zhangetal.[140] StarSkeleton BoW Lowlv Hand (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) Zouetal.[141] KeySegmentMining BoW Lowlv Dict (cid:88) (cid:88) (cid:88) KakadiarisandMetaxas[142] PhysicsBasedModel Conc Lowlv Hand (cid:88) Nieetal.[143] STParts BoW Body Dict (cid:88) (cid:88) Anirudhetal.[144] TVSRFSpace Conc Manif Hand (cid:88) (cid:88) (cid:88) (cid:88) KoppulaandSaxena[145] TemporalRelationalFeatures Conc Lowlv Hand (cid:88) WuandShao[146] EigenJoints Conc Lowlv Deep (cid:88) (cid:88) (cid:88) (cid:88) Kerolaetal.[147] SpectralGraphSkeletons Conc Lowlv Hand (cid:88) (cid:88) (cid:88) fined the spatial joint displacement as (∆x,∆y,∆z) = Joints,whichcombinesthreecategoriesoffeaturesincluding (x,y,z)−(xc,yc,zc), where the reference joint can be the staticposture,motion,andoffsetfeatures.Inparticular,the skeleton centroid or a manually selected, fixed joint. For joint displacement of the current frame with respect to the each sequence of human skeletons representing an activity, previous frame and initial frame is calculated. Ellis et al. the computed displacements along each dimension (e.g., [115] introduced an algorithm to reduce latency for action ∆x, ∆y or ∆z) are used as features to represent humans. recognition using a 3D skeleton-based representation that Luo et al. [134] applied similar position information for depends on spatio-temporal features computed from the feature extraction. Since the joint hip center has relatively information in three frames: the current frame, the frame small motions for most actions, they used that joint as the collected 10 time steps ago, and the frame collected 30 reference.Luetal.[124]introducedHistogramsof3DJoint framesago.Then,thefeaturesarecomputedasthetemporal Locations (HOJ3D) features by assigning 3D joint positions displacementamongthosethreeframes.Anotherapproach into cone bins in 3D space. Twelve key joints are selected to construct temporal displacement representations incor- and their displacements are computed with respect to the porates the object being interacted with in each pose [114]. centertorsopoint.Usinglineardiscriminantanalysis(LDA), This approach constructs a hierarchical graph to represent the features are reprojected to extract the dominant ones. positions in 3D space and motion through 1D time. The Since the spherical coordinate system used in [124] is ori- differencesofjointcoordinatesintwosuccessiveframesare entedwiththexaxisalignedwiththedirectionapersonis defined as the features. Hu et al. [100] introduced the joint facing,theirapproachisviewinvariant. heterogeneous features learning (JOULE) model through extracting the pose dynamics using skeleton data from a 3.1.2 TemporalJointDisplacement sequence of depth images. A real-time skeleton tracker is usedtoextractthetrajectoriesofhumanjoints.Thenrelative 3Dhumanrepresentationsbasedontemporaljointdisplace- positionsofeachtrajectorypairisusedtoconstructfeatures ments compute the location difference across a sequence todistinguishdifferenthumanactions. of frames acquired at different time points. Usually, they employ both spatial and temporal information to represent Thejointmovementvolumeisanotherfeatureconstruc- peopleinspaceandtime. tionapproachforhumanrepresentationthatalsousesjoint A widely used temporal displacement feature is imple- displacement information for feature extraction, especially mentedbycomparingthejointcoordinatesatdifferenttime when a joint exhibits a large movement [133]. For a given steps.YangandTian[130],[131]introducedanovelfeature joint, extreme positions during the full joint motion are based on the position difference of joints, called Eigen- computed along x, y, and z axes. The maximum moving 8 TABLE4 Summaryof3DSkeleton-BasedRepresentationsBasedonJointOrientationFeatures.NotationIsPresentedinTable3. Feature Structure Feature Reference Approach T VI ScI SpI OL RT Encoding andTransition Engineering Sungetal.[117][148] OrientationMatrix Conc Lowlv Hand (cid:88) (cid:88) (cid:88) Fothergilletal.[128] JointAngles Conc Lowlv Hand (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) Guetal.[149] GestureRecognition BoW Lowlv Dict (cid:88) (cid:88) (cid:88) JinandChoi[150] PairwiseOrientation Stat Lowlv Hand (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) ZhangandTian[151] PairwiseFeatures Stat Lowlv Hand (cid:88) (cid:88) (cid:88) Kapsourasand DynemesRepresentation BoW Lowlv Dict (cid:88) Nikolaidis[152] Vantigodiand Meta-cognitiveRBF Stat Lowlv Hand (cid:88) (cid:88) (cid:88) (cid:88) Radhakrishnan[153] Ohn-BarandTrivedi[154] HOG2 Conc Lowlv Hand (cid:88) (cid:88) (cid:88) Chaudhryetal.[155] ShapefromNeuroscience BoW Body Dict (cid:88) (cid:88) Oflietal.[156] SMIJ Conc Lowlv Unsup (cid:88) (cid:88) (cid:88) Mirandaetal.[157] JointAngle BoW Lowlv Dict (cid:88) (cid:88) (cid:88) (cid:88) FuandSantello[158] HandKinematics Conc Lowlv Hand (cid:88) (cid:88) Zhouetal.[159] 4Dquaternions BoW Lowlv Dict (cid:88) (cid:88) (cid:88) (cid:88) CampbellandBobick[160] PhaseSpace Conc Lowlv Hand (cid:88) (cid:88) (cid:88) BoubouandSuzuki[161] HOVV Stat Lowlv Hand (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) Sharafetal.[162] Jointanglesandvelocities Stat Lowlv Hand (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) Salakhutdinovetal.[163] HDModels Conc Lowlv Deep (cid:88) (cid:88) (cid:88) (cid:88) Parameswaran ISTs Conc Lowlv Hand (cid:88) (cid:88) (cid:88) (cid:88) andChellappa[164] the person’s torso. A similar approach was also introduced in[148]basedontheorientationmatrix. Another approach is to calculate the orientation of two joints, called relative joint orientations. Jin and Choi [150] utilized vector orientations from one joint to another joint, named the first order orientation vector, to construct 3D (a) Displacementofpair- (b) Relative joint displacement and joint human representations. The approach also proposed a sec- wisejoints[132] motionvolumefeatures[133] ond order neighborhood that connects adjacent vectors. The authors used a uniform quantization method to con- Fig.3. Examplesof3Dhumanrepresentationsbasedonjointdisplace- ments. vertthecontinuousorientationsintoeightdiscretesymbols to guarantee robustness to noise. Zhang and Tian [151] used a two mode 3D skeleton representation, combining rangeofeachjointalongeachdimensionisthencomputed structural data with motion data. The structural data is byLa =max(aj)−min(aj),wherea=x,y,z;andthejoint represented by pairwise features, relating the positions of volumeisdefinedasVj =LxLyLz,asdemonstratedinFig. each pair of joints relative to each other. The orientation 3(b). For each joint, Lx,Ly,Lz and Vj are flattened into a betweentwojointsiandj wasalsoused,whichisgivenby featurevector.Theapproachalsoincorporatesrelativejoint (cid:16) (cid:17) θ(i,j) = arcsin ix−jx /2π, where dist(i,j) denotes the displacementswithrespecttothetorsojointintothefeature. dist(i,j) geometrydistancebetweentwojointsiandj in3Dspace. 3.2 Orientation-BasedRepresentations 3.2.2 TemporalJointOrientation Anotherwidelyusedinformationmodalityforhumanrep- resentationconstructionisbasedonjointorientations,since Humanrepresentationsbasedontemporaljointorientations ingeneralorientation-basedfeaturesareinvarianttohuman usually compute the difference between orientations of the position,bodysize,andorientationtothecamera. same jointacross a temporalsequence of frames.Campbell and Bobick [160] introduced a mapping from the Cartesian 3.2.1 SpatialOrientationofPairwiseJoints spacetothe“phasespace”.Bymodelingthejointtrajectory Approachesbasedonspatialorientationsofpairwisejoints in the new space, the approach is able to represent a curve computetheorientationofdisplacementvectorsofapairof that can be easily visualized and quantifiably compared to humanskeletaljointsacquiredatthesametimestep. other motion curves. Boubou and Suzuki [161] described Apopularorientation-basedhumanrepresentationcom- a representation based on the so-called Histogram of Ori- putestheorientationofeachjointtothehumancentroidin entedVelocityVectors(HOVV),whichisahistogramofthe 3Dspace.Forexample,Guetal.[149]collectedtheskeleton velocity orientations computed from 19 human joints in a data with fifteen joints and extracted features representing skeletonkinematicmodelacquiredfromtheKinectv1cam- joint angles with respect to the person’s torso. Sung et al. era. Each temporal displacement vector is described by its [117] computed the orientation matrix of each human joint orientationin3Dspaceasthejointmovesfromtheprevious with respect to the camera, and then transformed the joint position to the current location. By using a static skeleton rotationmatrixtoobtainthejointorientationwithrespectto prior to deal with static poses with little or no movement, 9 TABLE5 SummaryofRepresentationsBasedonRawPositionInformation.NotationIsPresentedinTable3. Feature Structure Feature Reference Approach T VI ScI SpI OL RT Encoding andTransition Engineering Duetal.[22] BRNNs Conc Body Deep (cid:88) Kazemietal.[112] JointPositions Conc Lowlv Hand (cid:88) Seidenarietal.[118] Multi-PartBagofPoses BoW Lowlv Dict (cid:88) (cid:88) (cid:88) Chaaraouietal.[165] EvolutionaryJointSelection BoW Lowlv Dict (cid:88) Reyesetal.[166] VectorofJoints Conc Lowlv Hand (cid:88) (cid:88) Patsaduetal.[167] VectorofJoints Conc Lowlv Hand (cid:88) (cid:88) HuangandKitani[168] CostTopology Stat Lowlv Hand Devanneetal.[169] MotionUnits Conc Manif Hand (cid:88) Wangetal.[170] MotionPoselets BoW Body Dict (cid:88) Weietal.[171] StructuralPrediction Conc Lowlv Hand (cid:88) (cid:88) Guptaetal.[172] 3DPosew/oBodyParts Conc Lowlv Hand (cid:88) (cid:88) Amoretal.[173] Skeleton’sShape Conc Manif Hand (cid:88) (cid:88) (cid:88) Sheikhetal.[174] ActionSpace Conc Lowlv Hand (cid:88) (cid:88) (cid:88) (cid:88) YilmaandShah[175] MultiviewGeometry Conc Lowlv Hand (cid:88) (cid:88) Gongetal.[176] StructuredTime Conc Manif Hand (cid:88) (cid:88) (cid:88) RahmaniandMian[177] KnowledgeTransfer BoW Lowlv Dict (cid:88) Munselletal.[178] MotionBiometrics Stat Lowlv Hand (cid:88) (cid:88) Lilloetal.[179] ComposableActivities BoW Lowlv Dict (cid:88) (cid:88) (cid:88) Wuetal.[180] Watch-n-Patch BoW Lowlv Dict (cid:88) (cid:88) (cid:88) (cid:88) GongandMedioni[181] DynamicManifolds BoW Manif Dict (cid:88) (cid:88) (cid:88) Hanetal.[182] HierarchicalManifolds BoW Manif Dict (cid:88) (cid:88) (cid:88) (cid:88) Slamaetal.[183],[184] GrassmannManifolds BoW Manif Dict (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) Devanneetal.[185] RiemannianManifolds Conc Manif Hand (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) Huangetal.[186] ShapeTracking Conc Lowlv Hand (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) Devanneetal.[187] RiemannianManifolds Conc Manif Hand (cid:88) (cid:88) (cid:88) (cid:88) Zhuetal.[188] RNNwithLSTM Conc Lowlv Deep (cid:88) Chenetal.[189] EnwMiLearning BoW Lowlv Dict (cid:88) (cid:88) (cid:88) Husseinetal.[190] Covarianceof3DJoints Stat Lowlv Hand (cid:88) (cid:88) (cid:88) (cid:88) Shahroudyetal.[191] FourierTemporalPyramid BoW Body Unsup (cid:88) (cid:88) (cid:88) JungandHong[192] ElementaryMovingPose BoW Lowlv Dict (cid:88) (cid:88) (cid:88) (cid:88) Evangelidisetal.[193] SkeletalQuad Conc Lowlv Hand (cid:88) (cid:88) (cid:88) AzaryandSavakis[194] GrassmannManifolds Conc Manif Hand (cid:88) (cid:88) (cid:88) (cid:88) Barnachonetal.[195] Hist.ofActionPoses Stat Lowlv Hand (cid:88) (cid:88) Shahroudyetal.[196] FeatureFusion BoW Body Unsup (cid:88) (cid:88) ZhouandDelaTorre[197] Spatio-temporalMatching BoW Lowlv Hand (cid:88) (cid:88) (cid:88) thismethodisabletoeffectivelyrepresenthumanswithstill skeleton frames, a matrix can be formed to naively encode posesin3Dspaceinhumanactionrecognitionapplications. the sequence with each column containing the flattened jointcoordinatesobtainedataspecifictimepoint.Following this direction, Hussein et al. [190] computed the statistical 3.3 RepresentationsBasedonRawJointPositions Covariance of 3DJoints (Cov3DJ) as their features, asillus- Besidesjointdisplacementsandorientations,rawjointposi- tratedinFig.4.Specifically,givenK humanjointswitheach tionsdirectlyobtainedfromsensorsarealsousedbymany joint denoted by pi = (xi,yi,zi),i = 1,...,K, a feature methodstoconstructspace-time3Dhumanrepresentations. vector is formed to encode the skeleton acquired at time t: S(t) = [x(t),...,x(t),y(t),...,y(t),z(t),...,z(t)](cid:62). Given a 1 K 1 K 1 K temporalsequenceofT skeletonframes,theCov3DJfeature iscomputedbyC(S)= 1 (cid:80)T (S(t)−S¯(t))(S(t)−S¯(t))(cid:62), T−1 t=1 where S¯ is the mean of all S. Since not all the joints are equally informative, several methods were proposed to selectkeyjointsthataremoredescriptive[165],[166],[167], [168]. Chaaraoui et al. [165] introduced an evolutionary al- gorithmtoselectasubsetofskeletonjointstoformfeatures. Then a normalizing process was used to achieve position, scale and rotation invariance. Similarly, Reyes et al. [166] selected 14 joints in 3D human skeleton models without normalization for feature extraction in gesture recognition applications. Anothergroupofrepresentationconstructiontechniques Fig.4.3DhumanrepresentationbasedontheCov3DJdescriptor[190]. utilizetherawjointpositioninformationtoformatrajectory, Acategoryofapproachesflattenjointpositionsacquired andthenextractfeaturesfromthistrajectory,whichareoften inthesameframeintoacolumnvector.Givenasequenceof calledthetrajectory-basedrepresentation.Forexample,Wei 10 TABLE6 SummaryofRepresentationsBasedonMulti-ModalInformation.NotationIsPresentedinTable3. Feature Structure Feature Reference Approach T VI ScI SpI OL RT Encoding &Transition Engineering Ganapathietal.[25] KinematicChain Conc Lowlv Hand (cid:88) AkhterandBlack[69] JointPositionwithLimits Conc Body Hand (cid:88) Ionescuetal.[92] MPJPE&MPJAE Conc Lowlv Hand (cid:88) (cid:88) (cid:88) (cid:88) Marinoiuetal.[93] VisualFixationPattern Conc Lowlv Hand Sigaletal.[96] ParametrizationoftheSkeleton Conc Lowlv Hand (cid:88) Huangetal.[104] SMMED Conc Lowlv Hand (cid:88) (cid:88) (cid:88) Bloometal.[105] PoseBasedFeatures Conc Lowlv Hand (cid:88) (cid:88) (cid:88) Yuetal.[107] Orderlets Conc Lowlv Hand (cid:88) (cid:88) (cid:88) Paiementetal.[108] NormalizedJoints Conc Manif Hand (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) KoppulaandSaxena[110] NodeFeatureMap Conc Lowlv Hand (cid:88) (cid:88) (cid:88) Sadeghipouretal.[116] SpatialPositions&Directions Conc Lowlv Hand (cid:88) Bloometal.[119] DynamicFeatures Conc Lowlv Hand (cid:88) (cid:88) (cid:88) Tenorthetal.[125] SetofNominalFeatures Conc Lowlv Hand Zanfiretal.[198] MovingPose BoW Lowlv Dict (cid:88) (cid:88) (cid:88) Lehrmannetal.[199] VectorofJoints Conc Lowlv Hand (cid:88) (cid:88) (cid:88) Bloometal.[200] DynamicFeatures Conc Lowlv Hand (cid:88) (cid:88) (cid:88) Vemulapallietal.[201] LieGroupManifold Co Manif Hand (cid:88) (cid:88) (cid:88) (cid:88) ZhangandParker[202] BIPOD Stat Body Hand (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) LvandNevatia[203] HMM/Adaboost Conc Lowlv Hand (cid:88) (cid:88) (cid:88) Pons-Molletal.[204] Posebits Conc Lowlv Hand Herdaetal.[205] Quaternions Conc Body Hand (cid:88) (cid:88) (cid:88) (cid:88) Neginetal.[206] RDFKinematicFeatures Conc Lowlv Unsup (cid:88) (cid:88) (cid:88) Masoodetal.[207] PairwiseJointDisplacement Conc Lowlv Hand (cid:88) (cid:88) (cid:88) &TemporalLocationVariations Gowayyedetal.[208] HOD Stat Lowlv Hand (cid:88) (cid:88) (cid:88) (cid:88) Meshryetal.[209] Angle&MovingPose BoW Lowlv Unsup (cid:88) (cid:88) (cid:88) (cid:88) TaoandVidal[210] MovingPoselets BoW Body Dict (cid:88) Eweiwietal.[211] DiscriminativeActionFeatures Conc Lowlv Unsup (cid:88) (cid:88) (cid:88) Guerra-Filho Visuo-motorPrimitives Conc Lowlv Hand (cid:88) (cid:88) (cid:88) andAloimonos[212] et al. [171] used a sequence of 3D human skeletal joints to theskeleton-basedrepresentationconstruction,inwhichthe construct joint trajectories, and applied wavelets to encode rawpositionsofhumanjointsaredirectlyusedastheinput eachtemporaljointsequenceintofeatures,whichisdemon- totheRNN.Zhuetal.[188]usedraw3Djointcoordinatesas strated in Fig. 5. Gupta et al. [172] proposed a cross-view theinputtoaRNNwithLongShort-TermMemory(LSTM) humanrepresentation,whichmatchestrajectoryfeaturesof toautomaticallylearnhumanrepresentations. videos to MoCap joint trajectories and uses these matches to generate multiple motion projections as features. Junejo 3.4 Multi-ModalRepresentations et al. [213] used trajectory-based self-similarity matrices (SSMs) to encode humans observed from different views. Since multiple information modalities are available, an in- Thismethodshowedgreatcross-viewstabilitytorepresent tuitive way to improve the descriptive power of a human humansin3DspaceusingMoCapdata. representation is to integrate multiple information sources and build a multi-modal representation to encode humans in3Dspace.Forexample,thespatialjointdisplacementand orientation can be integrated together to build human rep- resentations. Guerra-Filho and Aloimonos [212] proposed a method that maps 3D skeletal joints to 2D points in the projectionplaneofthecameraandcomputesjointdisplace- ments and orientations of the 2D joints in the projected plane.Gowayyedetal.[208]developedthehistogramofori- ented displacements (HOD) representation that computes the orientation of temporal joint displacement vectors and usestheirmagnitudeastheweighttoupdatethehistogram Fig.5.Trajectory-basedrepresentationbasedonwaveletfeatures[171]. inordertomaketherepresentationspeed-invariant. Multi-modal space-time human representations were Similartotheapplicationofdeeplearningtechniquesto alsoactivelystudied,whichareabletointegratebothspatial extractfeaturesfromimageswhererawpixelsaretypically andtemporalinformationandrepresenthumanmotionsin used as input, skeleton-based human representations built 3Dspace.Yuetal.[107]integratedthreetypesoffeaturesto by deep learning methods generally rely on raw joint po- construct a spatio-temporal representation, including pair- sition information. For example, Du et al. [22] proposed an wisejointdistances,spatialjointcoordinates,andtemporal end-to-endhierarchicalrecurrentneuralnetwork(RNN)for variations of joint locations. Masood et al. [207] imple-

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.