ebook img

Feature Sampling Strategies for Action Recognition PDF

4.7 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Feature Sampling Strategies for Action Recognition

FEATURESAMPLINGSTRATEGIESFORACTIONRECOGNITION YoujieZhou,HongkaiYuandSongWang DepartmentofComputerScienceandEngineering,UniversityofSouthCarolina [email protected],[email protected],[email protected] ABSTRACT 5 1 Although dense local spatial-temporal features with bag-of- 0 features representation achieve state-of-the-art performance 2 for action recognition, the huge feature number and feature Video Frame Dense Sampling Random Sampling Selective Sampling n sizepreventcurrentmethodsfromscalinguptorealsizeprob- a lems. In this work, we investigate different types of feature Fig.1. Differentfeaturesamplingmethodsforactionrecog- J samplingstrategiesforactionrecognition,namelydensesam- nition. 8 pling, uniformly random sampling and selective sampling. 2 We propose two effective selective sampling methods using part model by which they are able to randomly sample fea- ] objectproposaltechniques.Experimentsconductedonalarge V tures at lower image scales in an efficient way. [6] inter- videodatasetshowthatweareabletoachievebetteraverage C recognitionaccuracyusing25%lessfeatures,throughoneof polated trajectories using uniformly distributed nearby fea- s. proposedselectivesamplingmethods,andevenremaincom- ture points. [4] investigated the influence of random sam- plingonrecognitionaccuracyinseverallargescaledatasets. c parableaccuracywhilediscarding70%features. [ However, intuitively, features extracted around informative IndexTerms—Actionrecognition,Videoanalysis,Feature regions,suchashumanarmsinhandswaving,shouldbemore 1 v sampling useful in action classification than features extracted on the 3 background. [7,8]proposedselectivesamplingstrategieson 9 1. INTRODUCTION densetrajectoryfeaturesbasedonsaliencymaps,producedby 9 modelinghumaneyemovementwhenviewingvideos. They 6 Giventhepopularityofsocialmedia,itbecomesmucheasier areabletoachievebetterrecognitionresultswithselectively 0 . to collect a large number of videos from Internet for human sampled features. However, it is impractical to obtain eye 1 actionrecognition. Effectivevideorepresentationisrequired movementdataforlargedatasets. 0 5 forrecognizinghumanactionsandunderstandingvideocon- Inthiswork,weinvestigateseveralfeaturesamplingstrate- 1 tentinsuchrapidlyincreasingunstructureddata. gies for action recognition, as illustrated in Fig.1, and pro- : By far, the commonly used video representation for ac- posetwodatadrivenselectivefeaturesamplingmethods. In- v i tion recognition has been the bag-of-words (BoW) model spiredbythesuccessofapplyingobjectproposaltechniques X [1]. The basic idea is summarizing/encoding local spatial- inefficientsaliencydetection[9],weconstructsaliencymaps r temporal features in a video as a simple vector. Among lo- usingonerecentobjectproposalmethod, EdgeBox[10,11], a calfeatures,densetrajectory(DT)[2]anditsimprovedvari- and selectively sample dense trajectory features for action ant (iDT) [3] provide state-of-the-art results on most action recognition.WefurtherextendEdgeBoxtoproduceproposals datasets [3]. The main idea is to construct trajectories by andconstructsaliencymapsforobjectswithmotionofinter- trackingdenselysampledfeaturepointsinframes,andcom- ests. More effective features can be sampled then for action putemultipledescriptorsalongthetrajectories. classification. Weevaluatedafewfeaturesamplingmethods Despitetheirsuccess,DTandiDTcanproducehugenum- onapubliclyavailabledatasets,andshowthatproposedmo- ber of local features, e.x., for a low resolution video in tionobjectproposalbasedselectivesamplingmethodisable 320×204 with 175 frames, they can generate ∼ 52 Mb of toachievebetteraccuracyusing25%lessfeaturesthanusing features[4]. Itisdifficulttostoreandmanipulatesuchdense thefullfeatureset. features for large datasets with thousands of high resolution The remaining of this paper is organized as follows: first videos,especiallyforreal-timeapplications. we give a brief introduction about the DT/iDT features and Existingworkfocusonreducingthetotalnumberoftrajec- othercomponentsinouractionclassificationframework,then toryfeaturesthroughuniformlyrandomsamplingatthecost three different feature sampling methods are described. Fi- of minor reduction in recognition accuracy. [5] proposed a nally,wediscussexperimentalresultsonalargevideodataset. air h h us br h c at c p m u j k c pi all b e as b g n wi s Fig.2. Illustrationofselectivesamplingmethodsviaobjectproposalalgorithms. Fromlefttoright,theoriginalvideoframe, denseopticalflowfield,estimatedobjectboundaries,top5scoringboxesgeneratedbyEdgeBox,saliencymapconstructedusing EdgeBoxproposals,estimatedmotionboundaries,top5scoringboxesgeneratedbyFusionEdgeBox,saliencymapconstructed usingFusionEdgeBox. 2. DENSETRAJECTORYFEATURES ity by a factor of two using Principal Component Analysis (PCA).Thenacodebookofsize256isformedbytheGaus- TheDTalgorithm[2]representsavideodatabydensetrajec- sianMixtureModel(GMM)algorithmonarandomselection tories,togetherwithappearanceandmotionfeaturesextracted of256,000featuresfromthetrainingset. Tocombinediffer- around trajectories. On each video frame, feature points are enttypesoffeatures,wesimplyconcatenatetheirl normal- 2 denselysampledusingagridwithas√pacingof5pixelsfor8 izedFishervectors. spatialscalesspacedbyafactorof1/ 2,asillustratedinthe Forclassification,weapplyalinearSVMprovidedbyLIB- second column of Fig.1. Then trajectories are constructed SVM[16],andone-over-restapproachisusedformulti-class by tracking feature points in the video based on dense opti- classification. Inallexperiments,wefixC =100inSVMas cal flows [12]. The default length of a trajectory is 15, i.e., suggestedin[3]. tracking feature points in 15 consecutive frames. The iDT algorithm[3]furtherenhancesthetrajectoryconstructionby eliminatingbackgroundmotionscausedbythecameramove- 3. FEATURESAMPLINGSTRATEGIES ment. For each trajectory, 5 types of descriptors are extracted: Inthefollowing,wedescribethreefeaturesamplingmethods, 1) the shape of the trajectory encodes local motion patterns, that are different from using all trajectories and related fea- whichisdescribedbyasequenceofdisplacementvectorson tures computed on dense grids as in the DT/iDT algorithms. bothx-andy-directions;2)HOG,histogramoforientedgra- Allthreemethodscanderiveasamplingprobabilityforeach dients [13], captures appearance information, which is com- trajectory feature to measure whether it will be sampled or putedina32×32×15spatio-temporalvolumesurrounding not, denoted by σ. For example, σ = 0.8 means we sample thetrajectory;3)HOF,histogramofopticalflow[14],focuses trajectoryfeatureswithprobabilitygreaterorequalto0.8for onlocalmotioninformation,whichiscomputedinthesame actionrecognition. spatio-temporalvolumeasinHOG;4+5)MBHxandMBHy, motion boundary histograms [14], are computed separately 3.1. UniformlyRandomSampling for the horizontal and vertical gradients of the optical flow. BothHOG,HOFandMBHarenormalizedappropriately. Followingpreviouswork[5,4],wesimplysampledensetra- To encode descriptors/features, we use Fisher vector [15] jectoryfeaturesinarandomanduniformway. Thesampling as in [3]. For each feature, we first reduce its dimensional- probability,σ,foreachtrajectoryisthesame.Inexperiments, we randomly sample 80%, 60%, 40% and 30% of trajec- where s is the original EdgeBox score, s is the pro- obj motion tory features, and report their action recognition accuracies posedmotionobjectivenessscore, andbalanceparametersα respectively. and β. We empirically fix α = β = 1 for all experiments. s isdefinedsimilarass ,i.e., basedonthenumberof motion obj whollyenclosedcontoursinabox. However, s utilizes motion 3.2. SelectiveSamplingviaObjectProposal contoursthataregroupedfrommotionboundaries,whichare estimatedasimagegradientsoftheopticalflowfield. Motion EdgeBox [10] is one of efficient object proposal algorithms boundaryexamplesareshowninthesixthcolumnofFig.2. [11] published recently. We utilize it to construct saliency ByapplyingthefusionscoreintotheEdgeBoxframework, maponeachvideoframe,andsampletrajectoryfeatureswith weareabletogenerateasetofproposalboxes,andconstruct respecttocomputedsaliencyvalues. the saliency map for feature sampling as well. Examples of In EdgeBox, given a video frame, object boundaries are top5scoringfusionboxesandconstructedsaliencymapsare estimatedviastructureddecisionforests[17],andobjectcon- illustratedinlasttwocolumnsofFig.2respectively.Compar- tours are formed by grouping detected boundaries with sim- ingwithexamplesgeneratedbytheoriginalEdgeBox(shown ilar orientations. In order to determine how likely a bound- in columns 3-5), we can see that FusionEdgeBox is able to ing box contains objects of interests, a simple but effective betterexploreregionswithmotionofinterests,whichisuse- objectiveness score s was proposed, based on the number obj fulforactionfeaturesampling(verifiedbylaterexperiments). of contours that are wholly enclosed by the box. We allow Similarly, wereportrecognitionaccuraciesusingsampled at most 10,000 boxes in different sizes and aspect ratios to trajectoryfeaturesforσwith0.2,0.4and0.6respectively. be examined for a frame. Fig.2 illustrates estimated object boundariesandtop5scoringboxesgeneratedbyEdgeBoxin thethirdandforthcolumnsrespectively. 4. EXPERIMENTS Given thousands of object proposal boxes, on a video frame, we construct a saliency map through a pixel voting We have conducted experiments on one publicly available procedure. Eachobjectproposalboxisconsideredasavote videodatasets,namelyJ-HMDB[18],whichconsistsof920 for all pixels located inside it. We normalize all pixel votes videosof21differentactions. Thesevideosareselectedfrom into[0,1]toformasaliencyprobabilitydistribution.Saliency a larger dataset HMDB [19]. J-HMDB also provides anno- map examples are illustrated in the fifth column of Fig.2. tatedboundingboxesforactorsoneachframe. Wereportthe Warmercolorsindicatehighersaliencyprobabilities. average classification accuracy among three training/testing Based on constructed saliency maps of a video, we are splitsettingsprovidedbyJ-HMDB. abletoselectivelysampletrajectoriesandrelatedfeatures. If In the following, we evaluate action recognition on J- thesaliencyprobabilityofthestartingpixelofatrajectoryis HMDB using sampled trajectory features through different higher than a predefined sampling probability σ, the trajec- methods, and discuss their performance. We also compare toryandrelatedfeatureswillbesampled. Inexperiments,we obtained accuracies with a few state-of-the-art action recog- report recognition accuracies for σ with 0.2, 0.4 and 0.6 re- nitionalgorithms. spectively. 4.1. InfluenceofSamplingStrategies 3.3. SelectiveSamplingviaMotionObjectProposal In addition to three introduced feature sampling methods, to better understanding trajectory features, we investigate the AlthoughbystackingboxesgeneratedviaEdgeBoxareable fourthsamplingmethodusingannotatedboundingboxesfor to highlight regions in a frame with saliency objects, con- actors. Wesampletrajectoryfeatures,ifthestartingpointof structed saliency map may not be suitable for sampling fea- atrajectorylocatesinsideanannotationbox. Similarstrategy tures for action recognition. For example, in the last row of wasproposedin[18],andwenameitasGT. Fig.2,theopticalflowfield(secondcolumn)clearlyindicates Figure 3 and 4 plot average classification accuracies over theregionwithmotionofinterestsforactionrecognitionislo- all classes for all sampling methods under different sam- catedaroundactor’sheadandarms,whiletopscoringboxes pling rates, using the DT feature and iDT feature respec- andconstructedsaliencymapviaEdgeBoxincorrectlyfocus tively. In general, through feature sampling, we are able to on actor’s legs. Thus, in order to incorporate with motion achieve higher performance than directly using all features, information, we propose a motion object proposal method, sincenoisebackgroundfeatureshavebeendiscarded. namedFusionEdgeBox,whereafusedobjectivenessscoreis Specifically, for the DT feature, we can see that: 1) tra- measuredonbothobjectboundariesandmotionboundaries. jectory features sampled inside annotated bounding boxes, Thefusionscorefunctionisdefinedas achieveshigheraccuracythanusingallfeatures. Similarphe- nomena has been observed in [18] as well which indicates s =αs +βs (1) DT features located around human body are more important fusion obj motion Method J-HMDB Memory(GB) 66 DenseTrajectory[2] 62.88% 5.4 65 ImprovedDenseTrajectory[3] 64.52% 4.2 64 Pengetal.[20]w/iDT 69.03%* 4.2 Gkioxarietal.[21] 62.5% - Average Accuracy666123 DT ERDdaignsecdBaoromdx20%∼25%6625..f33e33at%%ures 44..35 FusionEdgeBox FusionEdgeBox 65.91% 4.0 60 EdgeBox Random 65.49% 3.4 Random 59 GT iDT EdgeBox 65.32% 3.6 DT FusionEdgeBox 65.11% 3.5 58 500 1000 1500 2000 2500 3000 3500 4000 4500 Features per Video Discard70%∼80%features Random 59.90% 1.1 Fig.3. AverageaccuraciesusingtheDTfeature. DT EdgeBox 58.51% 1.4 FusionEdgeBox 60.71% 1.4 Random 62.34% 1.3 66 iDT EdgeBox 58.85% 1.2 65 FusionEdgeBox 60.87% 1.3 64 Table1. Comparisontostate-of-the-artsintermsofaverage Average Accuracy666123 aecnccuordaicnygatencdhfneiaqtuuer,esstaizcek.e*dFItislehveerrvaegcetsora.nadvancedfeature FusionEdgeBox 60 EdgeBox 4.2. Comparisonstostate-of-the-arts Random 59 GT iDT Table 1 shows comparisons of feature sampling methods in 58 500 1000 1500 2000 2500 3000 3500 different sampling rates with the state-of-the-arts. Sampling Features per Video methodsachievebetteraverageaccuraciesthanafewstate-of- the-arts using same classification pipeline, with ∼ 20% less Fig.4. AverageaccuraciesusingtheiDTfeature. features.Itisinterestingtoobservethat,evendiscardingmore than 70% features, random sampling and proposed selective samplingstillareabletoremaincomparableperformance. than features extracted on other regions. 2) Selective sam- pling methods achieve higher accuracies than random sam- plinggivensimilarnumberofsampledfeatures. Itshowsthat samplingDTfeaturesfromcertainregionsisimportantforac- 5. CONCLUSIONS tionrecognition,andobjectproposalbasedstrategiesareable to detect these regions. 3) Proposed selective sampling via motionobjectproposaloutperformsothersamplingmethods, Inthiswork, wefocusonfeaturesamplingstrategiesforac- evenoutperformstheonebasedonannotatedboundingboxes. tion recognition in videos. Dense trajectory features are uti- ItverifiesthatproposedFusionEdgeBoxmethodisusefulfor lizedtorepresentvideos.Twotypesofsamplingstrategiesare exploringregionsofinterestsforactionrecognition. investigated, namely uniformly random sampling and selec- For the iDT feature, however, different sampling method tivesampling. Weproposetouseobjectproposaltechniques result in similar accuracies. Random sampling outperforms toconstructsaliencymapsforvideoframes,andusethemto others slightly, especially when the number of sampled fea- guide the selective feature sampling process. We also pro- turesissmall. Thereasonmaybethat, byeliminatingback- poseamotionobjectproposalmethodthatincorporateobject groundmotioncausedbythecameramovement,theiDTfea- motioninformationintoobjectproposalframework. Experi- ture is more compact and meaningful than the DT feature, ments conducted on a large video dataset indicate that sam- e.x., the average number of iDT features per video is much pling based methods are able to achieve better recognition lowerthanitofDTfeature. Randomsamplingisabletobet- accuracyusing25%lessfeaturesthroughoneofproposedse- terpreservetheoriginaliDTfeaturedistributionthanselective lectivefeaturesamplingmethod,andevenremaincomparable samplingswhichhavequitelargesamplingbias. accuracywithdiscarding70%features. 6. REFERENCES [15] Florent Perronnin, Jorge Sa´nchez, and Thomas Mensink, “Improving the fisher kernel for large-scale [1] Xiaojiang Peng, Limin Wang, Xingxing Wang, and imageclassification,” inECCV,2010. Yu Qiao, “Bag of visual words and fusion methods [16] Chih-Chung Chang and Chih-Jen Lin, “Libsvm: a li- for action recognition: Comprehensive study and good practice,” CoRR,vol.abs/1405.4506,2014. braryforsupportvectormachines,” ACMTransactions on Intelligent Systems and Technology, vol. 2, pp. 27, [2] Heng Wang, Alexander Klaser, Cordelia Schmid, and 2011. Cheng-Lin Liu, “Action recognition by dense trajecto- [17] Piotr Dolla´r and C. Lawrence Zitnick, “Structured ries,” inCVPR,2011. forestsforfastedgedetection,” inICCV,2013. [3] HengWangandCordeliaSchmid, “Actionrecognition [18] Hueihan Jhuang, Juergen Gall, Silvia Zuffi, Cordelia withimprovedtrajectories,” inICCV,2013. Schmid, and Michael J. Black, “Towards understand- ingactionrecognition,” inICCV,2013. [4] Michael Sapienza, Fabio Cuzzolin, and Philip H. S. Torr, “Feature sampling and partitioning for visual [19] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and vocabulary generation on large action classification T. Serre, “Hmdb: A large video database for human datasets,” CoRR,vol.abs/1405.7545,2014. motionrecognition,” inICCV,2011. [5] FengShi,EmilPetriu,andRobertLaganiere,“Sampling [20] Xiaojiang Peng, Changqing Zou, Yu Qiao, and Qiang strategies for real-time action recognition,” in CVPR, Peng, “Actionrecognitionwithstackedfishervectors,” 2013. inECCV,2014. [6] VadimKantorovandIvanLaptev, “Efficientfeatureex- [21] Georgia Gkioxari and Jitendra Malik, “Finding action traction,encodingandclassificationforactionrecogni- tubes,” CoRR,vol.abs/1411.6031,2014. tion,” inCVPR,2014. [7] StefanMatheandCristianSminchisescu,“Dynamiceye movementdatasetsandlearntsaliencymodelsforvisual actionrecognition,” inECCV,2012. [8] Eleonora Vig, Michael Dorr, and David Cox, “Space- variantdescriptorsamplingforactionrecognitionbased onsaliencyandeyemovements,” inECCV,2012. [9] HongliangLi,FanmanMeng,andKingNgiNgan,“Co- salient object detection from multiple images,” IEEE Transactions on Multimedia, vol. 15, pp. 1896–1909, 2013. [10] CLawrence.ZitnickandPiotrDolla´r,“Edgeboxes:Lo- catingobjectproposalsfromedges,” inECCV,2014. [11] Jan Hosang, Rodrigo Benenson, and Bernt Schiele, “Howgoodaredetectionproposals,really?,” inBMVC, 2014. [12] Gunnar Farneba¨ck, “Two-frame motion estimation basedonpolynomialexpansion,” inScandinavianCon- ferenceonImageAnalysis,2003. [13] NavneetDalalandBillTriggs, “Histogramsoforiented gradientsforhumandetection,” inCVPR,2005. [14] NavneetDalal,BillTriggs,andCordeliaSchmid, “Hu- mandetectionusingorientedhistogramsofflowandap- pearance,” inECCV,2006.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.