Pose Invariant Embedding for Deep Person Re-identification LiangZheng†,YujiaHuang‡,HuchuanLu§,YiYang† †UniversityofTechnologySydney ‡CMU §DalianUniversityofTechnology {liangzheng06,yee.i.yang}@gmail.com [email protected] [email protected] 7 Abstract 1 0 Pedestrian misalignment, which mainly arises from de- 2 tector errors and pose variations, is a critical problem for n a robust person re-identification (re-ID) system. With bad a J alignment,thebackgroundnoisewillsignificantlycompro- misethefeaturelearningandandmatchingprocess. Toad- 6 2 dressthisproblem,thispaperintroducestheposeinvariant embedding (PIE) as a pedestrian descriptor. First, in or- Figure1.ExamplesofmisalignmentcorrectionbyPoseBox.Row ] der to align pedestrians to a standard pose, the PoseBox 1: originalboundingboxeswithdetectionerrors/occlusions. Ev- V structure is introduced, which is generated through pose eryconsecutivetwoboxescorrespondtoasameperson. Row2: C correspondingPoseBoxes. Weobservethatmisalignmentcanbe estimation followed by affine transformations. Second, to . correctedtosomeextent. s reduce the impact of pose estimation errors and informa- c tion loss during PoseBox construction, we design a Pose- [ Box fusion (PBF) CNN architecture that takes the origi- head,orthatoneisridingabicycleinsteadofbeingupright. 1 nalimage,thePoseBox,andtheposeestimationconfidence Thesecondcauseofmisalignmentisdetectionerror. Asil- v as input. The proposed PIE descriptor is thus defined as 2 lustratedinthesecondrowofFig. 1, detectionerrorsmay the fully connected layer of the PBF network for the re- 3 leadtosevereverticalmisalignment. 7 trieval task. Experiments are conducted on the Market- Whenpedestriansarepoorlyaligned,there-IDaccuracy 7 1501, CUHK03, and VIPeR datasets. We show that Pose- can be compromised. For example, a common practise in 0 Boxaloneyieldsdecentre-IDaccuracy, andthatwhenin- . re-IDistopartitiontheboundingboxintohorizontalstripes 1 tegrated in the PBF network, the learned PIE descriptor [20,42,1,21]. Thismethodworksundertheassumptionof 0 producescompetitiveperformancecomparedwiththestate- 7 of-the-artapproaches. slight vertical misalignment. But when vertical misalign- 1 mentdoeshappenasinthecasesinRow2ofFig. 1,one’s : head will be matched to the background of a misaligned v i 1.Introduction image. Sohorizontalstripesmaybelesseffectivewhense- X veremisalignmenthappens. Inanotherexample,undervar- r Thispaperstudiesthetaskofpersonre-identification(re- ious pedestrian poses, the background may be incorrectly a ID).Givenaprobe(personofinterest)andagallery,weaim weighted by the feature extractors and thus affect the fol- tofindinthegalleryalltheimagescontainingthesameper- lowingmatchingaccuracy. sonwiththeprobe. Wefocusontheidentificationproblem, To our knowledge, two previous works [8, 7] from the aretrievaltaskinwhicheachprobehasatleastoneground same group explicitly consider the misalignment problem. truthinthegallery[42]. Anumberoffactorsaffectthere- In both works, the pictorial structure (PS) is used, which IDaccuracy,suchasdetection/trackingerrors,variationsin shares a similar motivation and construction process with illumination,pose,viewpoint,etc. PoseBox,andtheretrievalprocessmainlyreliesonmatch- Acriticalinfluencingfactoronre-IDaccuracyisthemis- ing the normalized body parts. While the idea of con- alignment of pedestrians, which can be attributed to two structingnormalizedposesissimilar,ourworklocatesbody causes. First, pedestrians naturally take on various poses joints using a state-of-the-art CNN based pose estimator, asshowninFig. 1. Posevariationsimplythattheposition and the components of PoseBox are different from PS as ofthebodypartswithintheboundingboxisnotpredictable. evidencedbylarge-scaleevaluations. Anotherdifferenceof Forexample,itispossiblethatone’shandsreachabovethe our work is the matching procedure. While [8, 7] do not 1 • Using PIE, we report competitive re-ID accuracy on theMarket-1501,CUHK03,andVIPeRdatasets. 2.RelatedWork Pose estimation. The pose estimation research has shifted from traditional approaches [8, 7] to deep learning followingthepioneerwork“DeepPose”[30]. Somerecent methodsemploymulti-scalefeaturesandstudymechanisms on how to combine them [29, 26]. It is also effective to Figure 2. Information loss and pose estimation errors that occur injectspatialrelationshipsbetweenbodyjointsbyregular- during PoseBox construction. Row 1: important pedestrian de- izing the unary scores and pairwise comparisons [11, 27]. tails (highlighted in red bounding boxes) may be missing in the Thispaperadoptstheconvolutionalposemachines(CPM) PoseBox.Row2:poseestimationerrorsdeterioratethequalityof [34], a state-of-the-art pose estimator with multiple stages PoseBoxes. Foreachimagepair,theoriginalimageanditsPose- andsuccessiveposepredictions. Boxareontheleftandright,respectively. Deep learning for re-ID. Due to its superior perfor- mance,deeplearningbasedmethodshavebeendominating discuss the pose estimation errors which prevalently exist there-IDcommunityinthepasttwoyears. Inthetwoear- inreal-worlddatasets,weshowthattheseerrorsmakerigid lierworks[20,39],thesiamesemodelwhichtakestwoim- feature learning/matching with only the PoseBox yield in- agesasinputisused. Inlaterworks,thismodelisimproved feriorresultstotheoriginalimage,andthatthethree-stream invariousways,suchasinjectingmoresophisticatedspatial PoseBoxfusionnetworkeffectivelyalleviatesthisproblem. constraint[1,6],modelingthesequentialpropertiesofbody parts using LSTM [32], and mining discriminative match- Consideringtheabove-mentionedproblemsandthelimit ing parts for different image pairs [31]. It is pointed out ofpreviousmethods,thispaperproposestheposeinvariant in [43] that the siamese model only uses weak re-ID la- embedding (PIE) as a robust visual descriptor. Two steps bels: two images being of the same person or not; and it areinvolved. First,weconstructaPoseBoxforeachpedes- is suggested that an identification model which fully uses trianboundingbox.PoseBoxdepictsapedestrianwithstan- thestrongre-IDlabelsbesuperior. Severalpreviousworks darizeduprightstance. Carefullydesignedwiththehelpof adopt the identification model [37, 36, 41]. In [41], the poseestimators[34],PoseBoxaimstoproducewell-aligned video frames are used as training samples of each person pedestrian images so that the learned feature can find the class,andin[37],effectiveneuronsarediscoveredforeach same person under intensive pose changes. Trained alone training domain and a new dropout strategy is proposed. using a standard CNN architecture [37, 41, 44], we show Thearchitectureproposedin[36]ismoresimilartothePBF thatPoseBoxyieldsverydecentre-IDaccuracy. modelinourwork. In[36],hand-craftedlow-levelfeatures Second, to reduce the impact of information loss and areconcatenatedafterafullyconnected(FC)layerwhichis pose estimation errors (Fig. 2) during PoseBox construc- connected to the softmax layer. Our network is similar to tion, we build a PoseBox fusion (PBF) CNN model with [36] in that confidence scores of pose estimation are cate- three streams as input: the PoseBox, the original image, natedwiththeothertwoFClayers. Itdepartsfrom[36]in and the pose estimation confidence. PBF achieves a glob- thatournetworktakesthreestreamsasinput,twoofwhich allyoptimizedtradeoffbetweentheoriginalimageandthe arerawimages. PoseBox. PIE is thus defined as the FC activations of the Posesforre-ID.Althoughposechangeshavebeenmen- PBFnetwork.Onseveralbenchmarkdatasets,weshowthat tionedbymanypreviousworksasaninfluencingfactoron the joint training procedure yields competitive re-ID accu- re-ID,onlyahandfulofreportscanbefounddiscussingthe racy to the state of the art. To summarize, this paper has connectionbetweenthem. Farenzenaetal. [12]proposeto threecontributions. detectthesymmetricalaxisofdifferentbodypartsandex- • Minor contribution: the PoseBox is proposed which tract features following the pose variation. In [35], rough shares a similar nature with a previous work [8]. It estimates of the upper-body orientation is provided by the enables well-aligned pedestrian matching, and yields HOGdetector,andtheupperbodyisthenrenderedintothe satisfyingre-IDperformancewhenbeingusedalone. texture of an articulated 3D model. Bak et al. [3] further classifyeachpersonintothreeposetypes: front,back,and • Major contribution: the pose invariant embedding side. A similar idea is exploited in [9], where four pose (PIE) is proposed as a part of the PoseBox Fusion typesareused. Bothworks[3,9]applyview-pointspecific (PBF) network. PBF fuses the original image, Pose- distance metrics according to different testing pose pairs. Box and the pose estimation errors, thus providing a The closest works to PoseBox are [8, 7], which construct fallbackmechanismwhenposeestimationfails. 2 arm by the elbow and wrist joints. The width of the arms boxesissetto20pixels.Similarly,theupperandlowerlegs aredefinedbythehipandkneejoints,andthekneeandan- klejoints,respectively.Theirwidthsareboth30pixels.The torsoisconfinedbyfourbodyjoints,i.e.,thetwoshoulders and the two hips, so we simply draw a quadrangle for the torso. Duetoposeestimationerrors,theaffinetransforma- tionmayencountersingularvalues. Soinpractice,weadd some small random disturbance when the pose estimation confidenceofabodypartisbelowathreshold(setto0.4). Three types of PoseBoxes. In several previous works discussing the performance of different parts, a common observationisthatthetorsoandlegsmakethelargestcon- tributions [8, 1, 6]. This is expected because the most distinguishing features exist in the upper-body and lower- bodyclothes.Basedontheexistingobservations,thispaper Figure3.PoseBoxconstruction.Givenaninputimage,thepedes- buildsthreetypesofPoseBoxesasdescribedbelow. trianposeisestimatedbyCPM[34]. Tenbodypartscanthenbe • PoseBox 1. It consists of the torso and two legs. A discoveredthroughthebodyjoints.ThreetypesofPoseBoxesare legiscomprisedoftheupperandthelowerlegs. Pose- built from the body parts. PoseBox1: torso + legs; PoseBox2: PoseBox1+arms;PoseBox3:PoseBox2+head. Box1includestwomostimportantbodypartsandisa baselinefortheothertwoPoseBoxtypes. • PoseBox 2. Based on PoseBox 1, we further add the the pictorial structure (PS), a similar concept to PoseBox. left and right arms. An arm includes the upper and They use traditional pose estimators and hand-crafted de- lower arm sub-modules. In our experiment we show scriptors that are inferior to CNN by a large margin. Our that PoseBox 2 is superior to PoseBox 1 due to the workemploysafullsetofstrongertechniques,anddesigns enrichedinformationbroughtbythearms. amoreeffectiveCNNstructureevidencedbythecompeti- • PoseBox3.OnthebasisofPoseBox2,weputthehead tivere-IDaccuracyonlarge-scaledatasets. box on top of the torso box. It is shown in [8] that the inclusion of head brought marginal performance 3.ProposedMethod increase. Inourcase, wefindthatPoseBox3slightly 3.1.PoseBoxConstruction inferiortoPoseBox2,probablybecauseofthefrequent head/neckestimationerrors. The construction of PoseBox has two steps, i.e., pose estimationandPoseBoxprojection. Remarks. TheadvantageofPoseBoxistwo-fold. First, Pose estimation. This paper adopts the off-the-shelf the pose variations can be corrected. Second, background model of the convolutional pose machines (CPM) [34]. In noisecanberemovedlargely. a nutshell, CPM is a sequential convolutional architecture PoseBox is also limited in two aspects. First, pose es- thatenforcesintermediatesupervisiontopreventvanishing timation errors often happen, leading to imprecisely de- gradients. A set of 14 body joints are detected, i.e., head, tected joints. Second, PoseBox is designed manually, so neck,leftandrightshoulders,leftandrightelbows,leftand it is not guaranteed to be optimal in terms of information rightwrists,leftandrighthips,leftandrightknees,andleft loss or re-ID accuracy. We address the two problems by a andrightankles,asshowninthesecondcolumnofFig. 3 fusionschemetobeintroducedinSection3.3. Forthesec- Body part discovery and affine projection. From the ond problem, specifically, we note that we construct Pose- detected joints, 10 body parts can be depicted (the third Boxesmanuallybecausecurrentre-IDdatasetsdonotpro- column of Fig. 3). The parts include head, torso, upper vide ground truth poses, without which it is not trivial to and lower arms (left and right), and upper and lower legs designanend-to-endlearningmethodtoautomaticallygen- (leftandright),whichalmostcoverthewholebody. These eratenormalizedposes. quadrilateral parts are projected to rectangles using affine 3.2.Baselines transformations. In more details, the head is defined with the joints of Thispaperconstructstwobaselinesbasedontheoriginal headandneck,andwemanuallysetthewidthofeachhead pedestrianimageandPoseBox, respectively. Accordingto box to 2 of its height (from head to neck). An upper arm theresultsintherecentsurvey[43],theidentificationmodel 3 isconfinedbytheshoulderandelbowjoints,andthelower [19]outperformstheverificationmodel[1,20]significantly 3 conv. layers FC layers (a) (b) (c) (d) or Figure4.ThebaselineidentificationCNNmodelusedinthispa- per.TheAlexNet[19]orResNet-50[15]withsoftmaxlossisused. TheFCactivationsareextractedforEuclidean-distancetesting. ontheMarket-1501dataset[42]: theformermakesfulluse (a) .05 .01 .06 .13 .07 .03 .04 .03 .16 .38 .51 .12 .55 .60 ofthere-IDlabels, i.e., theidentityofeachboundingbox, (b) .03 .01 .05 .05 .07 .04 .06 .08 .04 .01 .02 .06 .02 .01 while the latter only uses weak labels, i.e., whether two (c) .75 .79 .64 .57 .45 .74 .43 .50 .49 .76 .52 .56 .72 .13 boxesbelongtothesameperson. Sointhispaperweadopt (d) .15 .34 .25 .71 .62 .58 .72 .55 .09 .04 .01 .11 .01 .01 the identification CNN model (Fig. 4). Specifically, this Figure5.Examplesofposeestimationerrorsandtheconfidence paperusesthestandardAlexNet[19]andResidual-50[15] scores. Upper: fourpedestrianboundingboxesnamedwith(a), (b), (c), and(d), andtheirposeestimationresults. Lower: pose architectures. Wereferreaderstotherespectivepapersfor estimation confidence scores of the four images. A confidence detailednetworkdescriptions. vectorconsistsof14numberscorrespondingtothe14bodyjoints. During training, we employ the default parameter set- Wehighlightthecorrectlydetectedjointsingreen. tings,excepteditingthelastFClayertohavethesamenum- ber of neurons as the number of distinct IDs in the train- ing set. During testing, given an input image resized to quality. For the second problem, the missing visual cues 224×224,weextracttheFC7/FC8activationsforAlexNet, canberescuedbyre-introducingtheoriginalimage,sothat andthePool5/FCactivationsforResNet-50. After(cid:96)2 nor- thediscriminativedetailsarecapturedbythedeepnetwork. malization, we use Euclidean distance to perform person Network. Given the above considerations, this paper retrievalinthetestingset. Withrespecttotheinputimage proposes a three-stream PoseBox Fusion (PBF) network type,twobaselinesareusedinthispaper: whichtakestheoriginalimage,thePoseBox,andtheconfi- • Baseline1:theoriginalimage(resizedto224×224)is dencevectorasinput(seeFig.6).ToleveragetheImageNet usedasinputtoCNNduringtrainingandtesting. [10]pre-trainedmodels,twotypesofimageinputs,i.e.,the • Baseline2: thePoseBox(resizedto224×224)isused original image and the PoseBox are resized to 256×256 asinputtoCNNduringtrainingandtesting. Notethat (thencroppedrandomlyto227×227)forAlexNet[19]and onlyonePoseBoxtypeisusedeachtime. 224×224 for ResNet-50 [15]. The third input, i.e., pose estimation confidence scores, is a 14-dim vector, in which 3.3.ThePoseBoxFusion(PBF)Network eachentryfallswithintherange[0,1]. Motivation. During PoseBox construction, pose esti- ThetwoimageinputsarefedtotwoCNNsofthesame mationerrorsandinformationlossmayhappen,leadingto structure. Due to the content differences of the original compromised quality of the PoseBox (see Fig. 2). On the image and its PoseBox, the two streams of convolutional onehand,poseestimationerrorsoftenhappen,asweusean layers do not share weights, although they are initialized off-the-shelf pose estimator (which is usually the case un- from the same seed model. The FC6 and FC7 layers are der practical usage). As illustrated in Fig. 5 and Fig. 1, connectedtotheseconvolutionallayers. Fortheconfidence poseestimationmayfailwhenthedetectionshavemissing vector,weaddasmallFClayerwhichprojectsthe14-dim parts or the pedestrian images are of low resolution. On vectortoa14-dimFCvector. Weconcatenatethethreein- theotherhand, whencroppinghumanpartsfromabound- puts at the FC7 layer, which is further fully connected to ing box, it is inevitable that important details are missed FC8. ThesumofthethreeSoftmaxlossesisusedforloss out,suchasbagsandumbrellas(Fig. 2). Thefailureinthe computation. When the ResNet-50 [15] is used instead of constructionofhigh-qualityPoseBoxesandtheinformation AlexNet,Fig. 6doesnothavetheFC6layers,andtheFC7 lossduringpartcroppingmayresultincompromisedresults andFC8layersareknownasPool5andFC. ofthebaseline2. Thisisconfirmedintheexperimentthat In Fig. 6, as denoted by the green bounding box, the baseline1yieldssuperiorre-IDaccuracytobaseline2. pose invariant embedding (PIE) can either be the concate- Forthefirstproblem,i.e.,theposeestimationerrors,we nated FC7 activations (4,096+4,096+14 = 8,206-dim) or can mostly foretell the quality of pose estimation by re- its next fully connected layer (751-dim and 1,160-dim for sorting to the confidence scores (examples can be seen in Market-1501 and CUHK03, respectively). For AlexNet, Fig. 5). Under high estimation confidence, we envision we denote the two PIE descriptors as PIE(A, FC7) and fine quality of the generated PoseBox. But when the pose PIE(A,FC8), respectively; forResNet-50, theyaretermed estimation confidence scores are low for some body parts, asPIE(R,Pool5)andPIE(R,FC),respectively. it may be expected that the constructed PoseBox has poor Duringtraining,batchesoftheinputtriplets(theoriginal 4 PIE (A, FC7) conv. layers FC6 FC7 FC8 … ori. Img xssoL a classification M tfo S n3-dim original image n1-dim n1-dim PIE (A, FC8) weights not shared sso PIE xL a classification M … tfoS n3-dim PoseBox ssoL PoseBox n1-dim n1-dim classification xaM tfo confidence score n3-dim S n2-dim n2-dim Figure6.IllustrationofthePoseBoxFusion(PBF)networkusingAlexNet. Thenetworkinputs,i.e.,theoriginalimage,itsPoseBox,and theposeestimationconfidence,arehighlightedinbold. Theformertwoundergoconvolutionallayersbeforethefullyconnections(FC). TheconfidencevectorundergoesoneFClayer,beforethethreeFC7layersareconcatenatedandfullyconnectedtoFC8. SoftMaxlossis used. TwoalternativesofthePIEdescriptorarehighlightedbygreenboxes. ForAlexNetandMarket-1501,PIE(A,FC7)is8,206-dim, andPIE(A,FC8)is751-dim;ForResNet-50,therewouldbenoFC6,andPIE(R,Pool5)is4,110-dim,andPIE(R,FC)is751-dim. image,itsPoseBox,andtheconfidencevector)arefedinto testingsets,eachconsistingof316IDsand632images.We PBF,andthesumofthethreelossesisback-propagatedto perform 10 random train/test splits and calculate the aver- the convolutional layers. The ImageNet pretrained model agedaccuracy. TheCUHK03datasetcontains1,360iden- initializesboththeoriginalimageandPoseBoxstreams. titiesand13,164images.Eachpersonisobservedby2cam- Duringtesting,giventhethreeinputsofanimage,weex- eras, andonaveragethereare4.8imagesundereachcam- tractPIEasthedescriptor. NotethatweapplyReLUonthe era.Weadoptthesingle-shotmodeandevaluatethisdataset extractedembeddings, whichproducessuperiorresultsac- under20randomtrain/testsplits.TheMarket-1501dataset cordingtoourpreliminaryexperiment. ThentheEuclidean isfeaturedby1,501IDs,19,732galleryimagesand12,936 distanceisusedtocalculatethesimilaritybetweentheprobe trainingimagescapturedby6cameras. BothCUHK03and andgalleryimages,beforeasortedranklistisproduced. Market-1501areproducedbytheDPMdetector[13]. The PBF has three advantages. First, the confidence vector CumulativeMatchingCharacteristics(CMC)curveisused isanindicatorwhetherPoseBoxisreliable. Thisimproves forallthethreedatasets,whichencodesthepossibilitythat thelearningabilityofPBFasastaticembeddingnetwork, thequerypersonisfoundwithinthetopnranksintherank sothataglobaltradeoffbetweenthePoseBoxandtheorig- list. For Market-1501 and CUHK03, we additionally em- inal image can be found. Second, the original image not ploy the mean Average Precision (mAP), which considers only enables a fallback mechanism when pose estimation both the precision and recall of the retrieval process [42]. fails,butalsoretrainsthepedestriandetailsthatmaybelost The evaluation toolbox provided by the Market-1501 au- duringPoseBoxconstructionbutareusefulindiscriminat- thorsisused. ingidentities. Third,thePoseBoxprovidesimportantcom- 4.2.ExperimentalSetups plementarycuestotheoriginalimage. Usingthecorrectly predictedjoints,pedestrianmatchingcanbemoreaccurate Our experiments directly employ the off-the-shelf con- with the well-aligned images. The influence of detection volutional pose machines (CPM) trained using the multi- errorsandposevariationscanthusbereduced. stageCNNmodeltrainedontheMPIIhumanposedataset [2]. Default settings are used with input images resized to 4.Experiment 384 × 192. For the PBF network, we replace the convo- lutional layers with those from either the AlexNet [19] or 4.1.Dataset ResNet-50[15]. WhenAlexNetisused,n =4,096,n = 1 2 Thispaperusesthreedatasetsforevaluation,i.e.,VIPeR 14,n = 751. When ResNet-50 is used, PBF will not 3 [14], CUHK03 [20], and Market-1501 [42]. The VIPeR havetheFC6layer,andtheFC7layerisdenotedbyPool5: dataset contains 632 identities, each having 2 images cap- n = 2,048,n = 751. We train the PBF network for 36 1 3 tured by 2 cameras. It is evenly divided into training and epochs. The initial learning rate is set to 0.01, and is re- 5 Table1.Comparisonoftheproposedmethodwithvariousbaselines. PoseBox2isemployedhere. Baseline1: trainingusingtheoriginal image.Baselin2:trainingusethePoseBox.PIE:proposedposeinvariantembedding.A:AlexNet.R:ResNet-50. Market-1501 CUHK03 Market-1501→VIPeR Methods dim 1 5 10 20 mAP 1 5 10 20 1 5 10 20 Baseline1(A,FC7) 4,096 55.49 76.28 83.55 88.98 32.36 57.15 83.50 90.85 95.70 17.44 31.84 41.04 51.36 Baseline1(A,FC8) 751 53.65 75.48 82.93 88.51 31.60 58.80 85.80 91.90 96.25 17.15 32.06 41.68 51.55 Baseline1(R,Pool5) 2,048 73.02 87.44 91.24 94.70 47.62 51.60 79.60 87.70 95.00 23.42 42.31 51.96 63.80 Baseline1(R,FC) 751 70.58 84.95 90.02 93.53 45.84 54.80 84.20 91.70 97.60 15.85 28.80 37.41 47.85 Baseline2(A,FC7) 4,096 52.22 71.53 78.95 85.04 28.95 39.90 71.40 82.30 90.00 17.28 32.59 42.25 55.09 Baseline2(A,FC8) 751 51.10 72.24 79.48 85.60 29.91 42.30 75.05 84.35 92.00 16.04 33.45 42.66 54.97 Baseline2(R,Pool5) 2,048 64.49 79.48 85.07 88.95 38.16 36.90 68.40 78.70 86.70 21.11 37.18 45.89 54.34 Baseline2(R,FC) 751 62.20 78.36 83.76 88.84 37.91 41.70 72.70 84.20 92.50 15.57 26.68 33.54 41.71 PIE(A,FC7) 8,206 64.61 82.07 87.83 91.75 38.95 59.80 85.35 91.85 95.85 21.77 38.04 46.61 56.61 PIE(A,FC8) 751 65.68 82.51 87.89 91.63 41.12 62.40 88.00 93.70 96.50 18.10 31.20 38.92 49.40 PIE(R,Pool5) 4,108 78.65 90.26 93.59 95.69 53.87 57.10 84.60 91.40 96.20 27.44 43.01 50.82 60.22 PIE(R,FC) 751 75.12 88.27 92.28 94.77 51.57 61.50 89.30 94.50 97.60 23.80 37.88 47.31 56.55 ducedby10xevery6epochs. Werunthedeeplearningex- 95 perimentsusingGTX1080undertheCaffeframework[16] 90 and the batch size is set to 32 and 16 using AlexNet and ResNet-50,respectively. ForbothCNNmodels,ittakes6-7 %)85 hours for the training process to converge on the Market- e ( 1501dataset. g rat80 PIE(Pool5)+kissme We train PIE on Market-1501 and CUHK03, respec- n B1+B2+kissme tively, which have relatively large data volumes. We chi75 PIE(Pool5)+EU at PIE(FC)+FC(img)+FC(pb)+EU also test the generalization ability of PIE on some smaller m70 PIE(Pool5,img)+EU datasets such as VIPeR. That is, we only extract features PIE(Pool5,pb)+EU 65 usingthemodelpre-trainedonMarket-1501,andthenlearn B1+EU B2+EU somedistancemetriconthesmalldatasets. 60 1 2 3 4 5 6 7 8 9 10 rank 4.3.Evaluation Figure 7. Comparison with various feature combinations on the Market-1501 dataset. ResNet-50 [15] is used. Kissme [18] is Baselines. Wefirstevaluatethethetwore-IDbaselines used for distance metric learning. “EU”: Euclidean distance. described in Section 3.2. The results on three datasets are “PIE(Pool5,img)”and“PIE(Pool5,pb)”denotethe2,048-dimsub- showninTable1. Twomajorconclusionscanbedrawn. vectorsofthefull4,108-dimPIE(Pool5)vector,correspondingto First,weobservethatverycompetitiveperformancecan theimageandPoseBoxstreamsofPBF,respectively. “FC(img)” be achieved by baseline 1, i.e., training with the original and“FC(pb)”arethe751-dimFCvectorsoftheimageandPose- BoxstreamsofPBF,respectively. “B1”and“B2”representbase- image.Specifically,onMarket-1501,weachieverank-1ac- line1and2,respectively,usingthe2,048-dimPool5features. curacyof55.49%and73.02%usingAlexNetandResNet- 50, respectively. These numbers are consistent with those reportedin[43]. Moreover,wefindthatFC7(Pool5)issu- mator,wespeculateinthefuturethatthePoseBoxbaseline perior to FC8 (FC) on Market-1501 but situation reverses canbeimprovedbyre-trainingposeestimationusingnewly on CUHK03. We speculate the CNN model is trained to labeleddataonthere-IDdatasets. be more specific to the Market-1501 training set due to its largerdatavolume,soretrievalonMarket-1501ismoreofa The effectiveness of PIE. We test PIE on the re-ID transfertaskthanCUHK03. Thisisalsoobservedintrans- benchmarks,andpresenttheresultsinTable1andFig. 7. ferringImageNetmodelstootherrecognitiontasks[28]. Comparing with baseline 1 and baseline 2, we observe Second,comparedwithbaseline1,wecanseethatbase- clearly that PIE yields higher re-ID accuracy. On Market- line 2 is to some extent inferior. On the Market-1501 1501, for example, when using AlexNet and the FC7 de- dataset,forexample,resultsobtainedbybaseline2is3.3% scriptor, our method exceeds the two baselines by +5.5% and8.9%lowerusingAlexNetandResNet-50,respectively. and +8,8% in rank-1 accuracy, respectively. With ResNet- The performance drop is expected due to the pose estima- 50, theimprovement becomesslightlysmaller, butstill ar- tion errors and information loss mentioned in Section 3.3. rivesat+5.0%and+6.8%,respectively. Specifically,rank- Since this paper only employs the off-the-shelf pose esti- 1accuracyandmAPonMarket-1501arriveat78.65%and 6 95 95 90 90 85 %) %)85 e (80 e ( at at80 ng r75 ng r PIE full model chi70 PIE − PoseBox2 chi75 −conf. vector at PIE − PoseBox3 at −two losses m65 PIE − PoseBox1 m70 −PoseBox baseline2 − PoseBox2 −img 60 baseline2 − PoseBox3 65 baseline1 baseline2 − PoseBox1 baseline2 55 60 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 rank rank Figure8.Re-IDaccuracyofthethreetypesofPoseBoxes.Results Figure9.AblationstudiesonMarket-1501.Fromthe“full”model, of both the baseline and PIE are presented on the Market-1501 we remove one component at a time. The removed components dataset. include PoseBox, original image, the confidence vector, and the twolossesofthePoseBoxandtheoriginalimage. 53.87%, respectively. OnCUHK03andVIPeR,consistent improvementoverthebaselinescanalsobeobserved. Box,theoriginalimage,theconfidencevector,andthetwo Moreover, Figure 7 shows that Kissme [18] marginally losses of the PoseBox and original image streams. The improves the accuracy, proving that the PIE descriptor is CMC curves are drawn in Fig. 9, from which three con- well-learned. The concatenation of the Pool5 features of clusionscanbedrawn. baseline 1 and 2 coupled with Kissme produces lower ac- First, when the confidence vector or the two losses are curacy compared with “PIE(Pool5)+kissme”, illustrating removed,theremainingsystemisinferiortothefullmodel, that the PBF network learns more effective embeddings butdisplayssimilaraccuracy. Theperformancedropisap- than learning separately. We also find that the 2,048- proximately 1% in the rank-1 accuracy. It illustrates that dim “PIE(Pool5,img)+EU” and “PIE(Pool5,pb)+EU” out- these two components are important regularization terms. performs the corresponding baseline 1 and 2. This sug- Theconfidencevectorinformsthesystemofthereliability geststhatPBFimprovesthebaselineperformanceprobably ofthePoseBox, thusfacilitatingthelearningprocess. The throughthebackpropagationofthefusedloss. two identification losses provide additional supervision to ComparisonofthethreetypesofPoseBoxes. InSec- prevent the performance degradation of the two individual tion3.1,threetypesofPoseBoxesaredefined. Theircom- streams.Second,aftertheremovalofthestreamoftheorig- parison results on Market-1501 are shown in Fig. 8. Our inalimage(“-img”),theperformancedropssignificantlybut observationistwo-fold. stillremainssuperiortobaseline2. Therefore,theoriginal First, PoseBox2 is superior to PoseBox1. On Market- imagestreamisveryimportant,asitreducesre-IDfailures 1501 dataset, PoseBox2 improves the rank-1 accuracy by thatlikelyresultfromposeestimationerrors. Third, when xx% over PoseBox1. The inclusion of arms therefore in- thePoseBoxstreamiscutoff(“-PoseBox”),thenetworkis creases the discriminative ability of the system. Since the inferiortothefullmodel,butisbetterthanbaseline1. This upper arm typically shares the same color/texture with the validatestheindispensabilityofPoseBox,andsuggeststhat torso, wespeculatethatitisthelong/shortsleevesthaten- theconfidencevectorimprovesbaseline1. hancethedescriptors. Second,PoseBox2hasbetterperfor- Comparison with the state-of-the-art methods. On mancethanPoseBox3aswell. ForPoseBox3, theintegra- Market-1501, we compare PIE with the state-of-the-art tion of the head introduces more noise due to the unstable methods in Table 2. It is clear that our method outper- head detection, which deteriorates the overall system per- forms these latest results by a large margin. Specifically, formance. Nevertheless, wefindinFig. 8thatthegapbe- we achieve rank-1 accuracy = 77.97%, mAP = 52.76% tweendifferentPoseBoxesdecreasesafterbeingintegrated using the single query mode. To our knowledge, we have inPBF.Itisbecausethecombinationwiththeoriginalim- setnewstateoftheartontheMarket-1501dataset. age reduces the impact of estimation errors and the infor- On CUHK03, comparisons are presented in Table 3. mationloss,acontributionmentionedinSection1. When metric learning is not used, our results are com- Ablation experiment. To evaluate the effectiveness of petitive in rank-1 accuracy with recent methods such as differentcomponentsofPBF,ablationexperimentsarecon- [31], but are superior in rank-5, 10, 20, and mAP. When ducted on the Market-1501 dataset. We remove one com- Kissme[18]isemployed,wereporthigherresults: rank-1 ponent from the full system at a time, including the Pose- =67.10%,andmAP=71.32%,whichexceedthecurrent 7 Table2.ComparisonwithstateoftheartonMarket-1501. Table 4. Comparison with state of the art on VIPeR. The top 6 Methods rank-1 rank-5 rank-10 rank-20 mAP rowsareunsupervised;thebottom10rowsusesupervision. BoW+Kissme[42] 44.42 63.90 72.18 78.95 20.76 Methods rank-1 rank-5 rank-10 rank-20 WARCA[17] 45.16 68.12 76 84 - GOG[25] 21.14 40.34 53.29 67.21 Temp.Adapt.[24] 47.92 - - - 22.31 EnhancedDeep[36] 15.47 34.53 43.99 55.41 SCSP[4] 51.90 - - - 26.35 SDALF[23] 19.87 38.89 49.37 65.73 NullSpace[40] 55.43 - - - 29.87 gBiCov[23] 17.01 33.67 46.84 58.72 LSTMSiamese[32] 61.6 - - - 35.3 BOW[42] 21.74 - - 60.85 GatedSiamese[31] 65.88 - - - 39.55 PIE 27.44 43.01 50.82 60.22 PIE(Alex) 65.68 82.51 87.89 91.63 41.12 XQDA[21] 40.00 67.40 80.51 91.08 PIE(Res50) 78.65 90.26 93.59 95.69 53.87 MLAPG[22] 40.73 - 82.34 92.37 +Kissme 79.33 90.76 94.41 96.52 55.95 WARCA[17] 40.22 68.16 80.70 91.14 NullSpace[40] 42.28 71.46 82.94 92.06 SI-CI[33] 35.8 67.4 83.5 - SCSP[4] 53.54 82.59 91.49 96.65 Table3.ComparisonwithstateoftheartonCUHK03(detected). Mirror[5] 42.97 75.82 87.28 94.84 Methods rank-1 rank-5 rank-10 rank-20 mAP Enhanced[36]+Mirror[5] 34.87 66.68 79.30 90.38 BoW+HS[42] 24.30 - - - - LSTMSiamese[32] 42.4 68.7 79.4 - ImprovedCNN[1] 44.96 76.01 83.47 93.15 - GatedSiamese[31] 37.8 66.9 77.4 - XQDA[21] 46.25 78.90 88.55 94.25 - PIE+Mirror[5]+MFA[38] 43.29 69.40 80.41 89.94 SI-CI[33] 52.2 74.3 92.3 - - Fusion+MFA 54.49 84.43 92.18 96.87 NullSpace[40] 54.70 84.75 94.80 95.20 - LSTMSiamese[32] 57.3 80.1 88.3 - 46.3 MLAPG[22] 57.96 87.09 94.74 98.00 - GatedSiamese[31] 61.8 80.9 88.3 - 51.25 baseline 1 PIE(Alex) 62.60 87.05 92.50 96.30 67.91 (image) PIE(Res50) 61.50 89.30 94.50 97.60 67.21 +Kissme 67.10 92.20 96.60 98.10 71.32 baseline 2 (PoseBox) PIE stateoftheart.Wenotethatin[17],veryhighresultsarere- portedonthehand-drawnsubsetbutnoresultscanbefound baseline 1 (image) onthedetectedset. Wealsonotethatmetriclearningyields smallerimprovementsonMarket-1501thanCUHK03, be- baseline 2 (PoseBox) causethePBFnetworkisbettertrainedonMarket-1501due toitsricherannotations. PIE On VIPeR, we extract features using the off-the-shelf PIE model trained on Market-1501, and the comparison Figure10.Samplere-IDresultsontheMarket-1501dataset. For is shown in Table 4. We first compare PIE (using Eu- each query placed on the left, the three rows correspond to the ranklistsofbaseline1, baseline2, andPIE,respectively. Green clideandistance)withthelatestunsupervisedmethods,e.g., boundingboxesdenotecorrectlyretrievedimages. the Gaussian of Gaussian (GoG) [25], the Bag-of-Words (BOW)[42]descriptors,etc.Weusetheavailablecodepro- videdbytheauthors.WeobservethatPIEexceedsthecom- trianmatching. peting methods in the rank-1, 5, and 10 accuracies. Then, comparedwithsupervisedworkswithoutfeaturefusion,our 5.Conclusion method(coupledwithMirrorRepresentation[5]andMFA [38])hasdecentresults. WefurtherfusethePIEdescriptor This paperexplicitly addresses the pedestrian misalign- withthepre-computedtransferreddeepdescriptors[36]and ment problem in person re-identification. We propose the theLOMOdescriptor[21].Weemploythemirrorrepresen- pose invariant embedding (PIE) as pedestrian descriptor. tation[5]andtheMFAdistancemetriccoupledwiththeChi WefirstconstructPoseBoxwiththe16jointsdetectedwith Squarekernel. Thefusedsystemachievesnewstateofthe the convolutional pose machine [34]. PoseBox helps cor- artontheVIPeRdatasetwithrank-1accuracy=54.49%. rect the pose variations caused by camera views, person Twogroupsofsamplere-IDresultsareshowninFig.10. motionsanddetectorerrorsandenableswell-alignedpedes- Inthefirstquery,forexample,thecyanclothesontheback- trian matching. PIE is thus learned through the PoseBox ground lead to the misjudgement of the foreground char- fusion(PBF)network,inwhichtheoriginalimageisfused acteristics, so that some pedestrians with local green/blue withthePoseBoxandtheposeestimationconfidence. PBF colorsincorrectlyreceivetopranks. UsingPIE,foreground reducestheimpactofposeestimationerrorsanddetailloss canbeeffectivelycropped,leadingtomoreaccuratepedes- duringPoseBoxconstruction.WeshowthatPoseBoxyields 8 fairaccuracywhenusedaloneandthatPIEproducescom- [14] D. Gray, S. Brennan, and H. Tao. Evaluating appearance petitiveaccuracycomparedwiththestateoftheart. modelsforrecognition,reacquisition,andtracking. InProc. IEEE International Workshop on Performance Evaluation References for Tracking and Surveillance (PETS), volume 3. Citeseer, 2007. 5 [1] E.Ahmed, M.Jones, andT.K.Marks. Animproveddeep [15] K.He,X.Zhang,S.Ren,andJ.Sun. Deepresiduallearning learningarchitectureforpersonre-identification.InProceed- forimagerecognition. InProceedingsoftheIEEEConfer- ingsoftheIEEEConferenceonComputerVisionandPattern enceonComputerVisionandPatternRecognition,2016. 4, Recognition,pages3908–3916,2015. 1,2,3,8 5,6 [2] M.Andriluka, L.Pishchulin, P.Gehler, andB.Schiele. 2d [16] Y.Jia,E.Shelhamer,J.Donahue,S.Karayev,J.Long,R.Gir- humanposeestimation: Newbenchmarkandstateoftheart shick, S. Guadarrama, and T. Darrell. Caffe: Convolu- analysis. In2014IEEEConferenceonComputerVisionand tionalarchitectureforfastfeatureembedding. InProceed- PatternRecognition,pages3686–3693.IEEE,2014. 5 ingsofthe22ndACMinternationalconferenceonMultime- [3] S.Bak,F.Martins,andF.Bremond. Personre-identification dia,pages675–678.ACM,2014. 6 by pose priors. In SPIE/IS&T Electronic Imaging, pages [17] C.JoseandF.Fleuret.Scalablemetriclearningviaweighted 93990H–93990H.InternationalSocietyforOpticsandPho- approximaterankcomponentanalysis. InEuropeanConfer- tonics,2015. 2 enceonComputerVision,2016. 8 [4] D.Chen,Z.Yuan,B.Chen,andN.Zheng. Similaritylearn- [18] M. Ko¨stinger, M. Hirzer, P. Wohlhart, P. M. Roth, and ing with spatial constraints for person re-identification. In H. Bischof. Large scale metric learning from equivalence Proceedings of the IEEE Conference on Computer Vision constraints. In IEEE Conference on Computer Vision and andPatternRecognition,pages1268–1277,2016. 8 PatternRecognition,pages2288–2295,2012. 6,7 [5] Y.-C. Chen, W.-S. Zheng, and J. Lai. Mirror represen- [19] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet tation for modeling view-specific transform in person re- classification with deep convolutional neural networks. In identification. In Proc. IJCAI, pages 3402–3408. Citeseer, AdvancesinNeuralInformationProcessingSystems,pages 2015. 8 1097–1105,2012. 3,4,5 [6] D.Cheng,Y.Gong,S.Zhou,J.Wang,andN.Zheng.Person [20] W.Li,R.Zhao,T.Xiao,andX.Wang.Deepreid:Deepfilter re-identification by multi-channel parts-based cnn with im- pairingneuralnetworkforpersonre-identification. InPro- provedtripletlossfunction.InProceedingsoftheIEEECon- ceedings of the IEEE Conference on Computer Vision and ferenceonComputerVisionandPatternRecognition,pages PatternRecognition,pages152–159,2014. 1,2,3,5 1335–1344,2016. 2,3 [21] S.Liao,Y.Hu,X.Zhu,andS.Z.Li. Personre-identification [7] D.S.ChengandM.Cristani. Personre-identificationbyar- by local maximal occurrence representation and metric ticulatedappearancematching. InPersonRe-Identification, learning. InProceedingsoftheIEEEConferenceonCom- pages139–160.Springer,2014. 1,2 puter Vision and Pattern Recognition, pages 2197–2206, [8] D. S. Cheng, M. Cristani, M. Stoppa, L. Bazzani, and 2015. 1,8 V.Murino. Custompictorialstructuresforre-identification. [22] S.LiaoandS.Z.Li. Efficientpsdconstrainedasymmetric InBritishMachineVisionConference,2011. 1,2,3 metriclearningforpersonre-identification. InProceedings [9] Y.-J.ChoandK.-J.Yoon.Improvingpersonre-identification of the IEEE International Conference on Computer Vision, viapose-awaremulti-shotmatching. InProceedingsofthe pages3685–3693,2015. 8 IEEEConferenceonComputerVisionandPatternRecogni- tion,pages1354–1362,2016. 2 [23] B.Ma,Y.Su,andF.Jurie. Bicov: anovelimagerepresen- tation for person re-identification and face verification. In [10] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei- BritishMachiveVisionConference,page11,2012. 8 Fei. Imagenet: A large-scale hierarchical image database. InProceedingsoftheIEEEConferenceonComputerVision [24] N. Martinel, A. Das, C. Micheloni, and A. K. Roy- andPatternRecognition,pages248–255,2009. 4 Chowdhury. Temporal model adaptation for person re- [11] X. Fan, K. Zheng, Y. Lin, and S. Wang. Combining lo- identification. InEuropeanConferenceonComputerVision, cal appearance and holistic view: Dual-source deep neural 2016. 8 networksforhumanposeestimation. InProceedingsofthe [25] T.Matsukawa,T.Okabe,E.Suzuki,andY.Sato. Hierarchi- IEEEConferenceonComputerVisionandPatternRecogni- calgaussiandescriptorforpersonre-identification. InPro- tion,pages1347–1355,2015. 2 ceedings of the IEEE Conference on Computer Vision and [12] M. Farenzena, L. Bazzani, A. Perina, V. Murino, and PatternRecognition,pages1363–1372,2016. 8 M.Cristani.Personre-identificationbysymmetry-drivenac- [26] A. Newell, K. Yang, and J. Deng. Stacked hourglass net- cumulation of local features. In Computer Vision and Pat- worksforhumanposeestimation. InEuropeanConference ternRecognition(CVPR),2010IEEEConferenceon,pages onComputerVision,2016. 2 2360–2367.IEEE,2010. 2 [27] L.Pishchulin, E.Insafutdinov, S.Tang,B.Andres, M.An- [13] B.Fernando,E.Fromont,D.Muselet,andM.Sebban. Dis- driluka,P.Gehler,andB.Schiele.Deepcut:Jointsubsetpar- criminativefeaturefusionforimageclassification. InPro- titionandlabelingformultipersonposeestimation. InPro- ceedings of the IEEE Conference on Computer Vision and ceedings of the IEEE Conference on Computer Vision and PatternRecognition,pages3434–3441,2012. 5 PatternRecognition,2016. 2 9 [28] A.SharifRazavian, H.Azizpour, J.Sullivan, andS.Carls- [42] L.Zheng,L.Shen,L.Tian,S.Wang,J.Wang,andQ.Tian. son. Cnnfeaturesoff-the-shelf: anastoundingbaselinefor Scalablepersonre-identification:Abenchmark.InProceed- recognition.InProceedingsoftheIEEEConferenceonCom- ingsoftheIEEEInternationalConferenceonComputerVi- puterVisionandPatternRecognitionWorkshops,pages806– sion,pages1116–1124,2015. 1,3,5,8 813,2014. 6 [43] L. Zheng, Y. Yang, and A. G. Hauptmann. Person re- [29] J.J.Tompson,A.Jain,Y.LeCun,andC.Bregler.Jointtrain- identification: Past, present and future. arXiv preprint ing of a convolutional network and a graphical model for arXiv:1610.02984,2016. 2,3,6 humanposeestimation. InAdvancesinneuralinformation [44] L. Zheng, H. Zhang, S. Sun, M. Chandraker, and Q. Tian. processingsystems,pages1799–1807,2014. 2 Person re-identification in the wild. arXiv preprint [30] A.ToshevandC.Szegedy. Deeppose: Humanposeestima- arXiv:1604.02531,2016. 2 tionviadeepneuralnetworks. InProceedingsoftheIEEE Conference on Computer Vision and Pattern Recognition, pages1653–1660,2014. 2 [31] R. R. Varior, M. Haloi, and G. Wang. Gated siamese convolutional neural network architecture for human re- identification. InEuropeanConferenceonComputerVision, 2016. 2,7,8 [32] R. R. Varior, B. Shuai, J. Lu, D. Xu, and G. Wang. A siameselongshort-termmemoryarchitectureforhumanre- identification. InEuropeanConferenceonComputerVision, 2016. 2,8 [33] F. Wang, W. Zuo, L. Lin, D. Zhang, and L. Zhang. Joint learningofsingle-imageandcross-imagerepresentationsfor person re-identification. In Proceedings of the IEEE Con- ferenceonComputerVisionandPatternRecognition,2016. 8 [34] S.-E.Wei,V.Ramakrishna,T.Kanade,andY.Sheikh. Con- volutionalposemachines.arXivpreprintarXiv:1602.00134, 2016. 2,3,8 [35] C.Weinrich,M.Volkhardt,andH.-M.Gross. Appearance- based 3d upper-body pose estimation and person re- identificationonmobilerobots. In2013IEEEInternational ConferenceonSystems,Man,andCybernetics,pages4384– 4390.IEEE,2013. 2 [36] S. Wu, Y.-C. Chen, X. Li, A.-C. Wu, J.-J. You, and W.-S. Zheng. Anenhanceddeepfeaturerepresentationforperson re-identification.In2016IEEEWinterConferenceonAppli- cationsofComputerVision(WACV),pages1–8.IEEE,2016. 2,8 [37] T.Xiao,H.Li,W.Ouyang,andX.Wang.Learningdeepfea- turerepresentationswithdomainguideddropoutforperson re-identification. InProceedingsoftheIEEEConferenceon ComputerVisionandPatternRecognition,2016. 2 [38] S.Yan,D.Xu,B.Zhang,H.-J.Zhang,Q.Yang,andS.Lin. Graphembeddingandextensions: ageneralframeworkfor dimensionality reduction. IEEE transactions on pattern analysisandmachineintelligence,29(1):40–51,2007. 8 [39] D.Yi,Z.Lei,S.Liao,S.Z.Li,etal. Deepmetriclearning for person re-identification. In ICPR, volume 2014, pages 34–39,2014. 2 [40] L.Zhang,T.Xiang,andS.Gong. Learningadiscriminative nullspaceforpersonre-identification. InProceedingsofthe IEEEConferenceonComputerVisionandPatternRecogni- tion,2016. 8 [41] L. Zheng, Z. Bie, Y. Sun, J. Wang, C. Su, S. Wang, and Q. Tian. Mars: A video benchmark for large-scale person re-identification. InEuropeanConferenceonComputerVi- sion,pages868–884.Springer,2016. 2 10

