ebook img

Ordered Pooling of Optical Flow Sequences for Action Recognition PDF

0.68 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Ordered Pooling of Optical Flow Sequences for Action Recognition

Ordered Pooling of Optical Flow Sequences for Action Recognition JueWang1,3 AnoopCherian2,3 FatihPorikli1,2,3 1Data61/CSIRO,2 AustralianCenterforRoboticVision 3 AustralianNationalUniversity,Canberra,Australia [email protected] [email protected] [email protected] 7 1 Abstract 0 2 Training of Convolutional Neural Networks (CNNs) on r p long video sequences is computationally expensive due to A thesubstantialmemoryrequirementsandthemassivenum- berofparametersthatdeeparchitecturesdemand.Earlyfu- 6 sionofvideoframesisthusastandardtechnique,inwhich ] several consecutive frames are first agglomerated into a V compact representation, and then fed into the CNN as an C inputsample. Forthispurpose,asummarizationapproach . that represents a set of consecutive RGB frames by a sin- s c gle dynamic image to capture pixel dynamics is proposed [ recently. In this paper, we introduce a novel ordered rep- resentation of consecutive optical flow frames as an alter- 2 v native and argue that this representation captures the ac- 6 tion dynamics more effectively than RGB frames. We pro- 4 vide intuitions on why such a representation is better for 2 action recognition. We validate our claims on standard 3 benchmarkdatasetsanddemonstratethatusingsummaries 0 . of flow images lead to significant improvements over RGB 1 frames while achieving accuracy comparable to the state- 0 of-the-artonUCF101andHMDBdatasets. 7 1 Figure1.ExamplesofdynamicflowimageandrelatedRGBvideo : v frames. From left to right and top to bottom, the action classes i 1.Introduction are: “brushhair”,“cartwheel”,“pullup”,“situp”,“drawsword” X and“clip”. Eachdynamicflowimageisatwochannelimagethat Automaticallyrecognizinghumanactionsinvideosisa r compactlysummarizesasequenceofopticalflowframesfroma a challenging task since actions in real-world are often very videosequence. complex,mayinvolvehardtodetectobjectsandtools,may havedifferenttemporalspeeds,canbecontaminatedbyac- tion clutter, or the same action can vary significantly from Usually,deeparchitecturesforvideobasedactionrecog- oneactortoanother,amongseveralotherfactors.Neverthe- nitiontakeasinputshortvideoclipsconsistingofonetoa less, developing efficient solutions for this problem could few tens of frames. Using longer subsequences would re- facilitate and nurture many applications, including visual quiredeepernetworksorinvolveahugenumberofparam- surveillance, augmented reality, video summarization, and eters, which might not fit in the GPU memory, or may be human-robot interaction. The recent resurgence of deep problematictotrainduetocomputationalcomplexity. This learninghasdemonstratedasignificantpromiseinamelio- restrictionandthusclippingofthetemporalreceptivefields ratingthedifficultiesinrecognizingactions. However,such ofthevideostoshortdurationsprohibitCNNsfromlearn- solutionsarestillfarfrombeingpracticallyuseful,andthus ing long-term temporal evolution of the actions, which is this problem continues to be a popular research topic in very important in recognition especially when the actions computervision[1,7,21,23,28,29]. arecomplex. One standard way to tackle this difficulty in capturing ageswithoutcombininganyothermethods. long-termdependenciesistousetemporalpoolingthatcan Beforemovingontoexplainingourschemeindetail,we be applied either when providing the input to the CNN or summarizebelowourimportantcontributions. afterextractingfeaturesfromintermediateCNNlayers[4]. 1. We provide an efficient and powerful video represen- Inthispaper,weexplorearecentlyintroducedearlyfu- tation, dynamic flow image, generated based on the sion scheme based on a Ranking SVM [10] formulation Ranking SVM formulation. Our representation ag- where several consecutive RGB frames in the video are gregateslocalactiondynamics(ascapturedbyoptical fused to generate one “dynamic image” by minimizing a flow)oversubseqeunceswhilepreservingthetemporal quadratic objective with temporal order constraints. The evolutionofthesedynamics. mainideaofthisschemeisthattheparameterslearnedafter solving this formulation capture the temporal ordering of 2. Weprovideanactionrecognitionframeworkbasedon the pixels from frame-to-frame, thus summarizing the un- the popular two-stream network [23], that also com- derlyingactiondynamics. bines dynamic flow images, RGB frames (for action One drawback of the rank pooling approach is that the context),andhand-craftedtrajectoryfeatures. formulationdoesnotdirectlycapturethemotionassociated with the action – it only captures how the pixel RGB val- 3. We provide extensive experimental comparisons ueschangefromframe-to-frame. Typically, thevideopix- demonstratingtheeffectivenessofourschemeontwo els canbe noisy. Given that the poolingconstraints in this challengingbenchmarkdatasets. schemeonlylookatincreasingpixelintensitiesfromframe- 2.RelatedWork to-frame, itcanfittonoisepixelsthatadheretothisorder, however, are unrelated to the action. To mitigate these is- Extractingdiscriminativefeaturesfromvideosisthefirst sues,inthispaper,welookattherankpoolingformulation andmostimportantstepinactionrecognition.Suchfeatures inthecontextofopticalflowimagesinsteadofRGBframes. can be roughly divided into two groups: handcrafted local It is clear that optical flow can easily circumvent the featuresanddeeplearnedfeatures. Ofcourse,therearealso aboveproblemsassociatedwithsequencesofRGBframes. manyinterestingmethodsthatarenotlimitedtothese,such The flow by itself captures the objects in motion in the asusinggeneticprogrammingtolearnspatial-temporalfea- videos,andtherebycapturingtheactiondynamicsdirectly, tures automatically [19], using human shape evolution of while thresholding the flow vectors helps to avoid noise. skeletons [2],andsoon. However,wewilllimitoursurvey Thus,wepositthatsummarizingsequencesofopticalflow tothetechniquesmostrelatedtotheapproachpresentedin canbesignificantlymorebeneficialthanusingRGBframes thispaper. foractionrecognition. Bysolvingtherank-SVMformula- Handcrafted Local Features. Using local features to tion (Section 3), we generate flow images that summarize generate representations for action recognition is popular the action dynamics, dubbed dynamic flow images. These because they make the recognition robust to noise, back- images are then used as input to a standard action recog- ground motion, or illumination changes. One noteworthy nition CNN architecture to train for the actions in the se- such representation is described in [28], where dense tra- quences. InFigure1,weshowafewsampledynamicflow jectories, that capture the dynamics of actions, is used to images generated using the proposed technique for the re- defineregionsofinterestinthevideo. Localfeatures(such spectiveRGBframes. as HOG, HOF, MBH, etc.), that directly relate to the ac- A natural question here can be regarding the intuitive tion can then be extracted from these regions to train clas- benefit of using such a flow summarization scheme, given sifiers. Therehavebeenextensionsofthisapproachtoalso that standard action recognition CNN frameworks already include semantics of the action, such as using Fisher vec- useastackofflowframes. Notethatsuchflowstacksusu- tors[22,31], usingbagofvisualwordsmodel [20], com- ally use only a few frames (usually 10), while using the bining with depth data [18], or using human pose [14]. dynamic flow images, we summarize several tens of flow However, the recent trend is towards automatically learn- frames,therebycapturinglong-termtemporalcontext. ing useful features in a data-driven way via convolutional Tovalidateourclaims,weprovideextensiveexperimen- neuralnetworks[4,7,29], andwefollowthistrendinthis talcomparisons(Section4)ontwostandardbenchmarkac- paper. tionrecognitiondatasets,namely(i)theHMDB-51dataset, Deep Learned Features. Deep learning methods have and (ii) the UCF-101 dataset. Our results show that using been extensively used for computer vision tasks since the dynamic flow images lead to significant improvements in workofKrizhevskyetal.[15]forobjectrecognition. There action recognition performance. We find that this leads to have been extensions of this model for the problem of ac- 4% improvement on HMDB-51[16] and 6% on the UCF- tionrecognitionrecently. In[23],atwo-streamCNNmodel 101[25] dataset in comparison to using dynamic RGB im- is proposed for this task with very promising results. The two-streammodelisenhancedusingaVGGnetworkin[7], obtainthisrepresentation. andintermediatefusionoftheconvolutionallayersincorpo- rated. A trajectory-pooled CNN setup is described in [29] min (cid:107)F(cid:107)2+C(cid:88)ξ (1) ij thatfusesCNNfeaturesalongactiontrajectories. Earlyand F∈Rd1×d2×2,ξ≥0 i<j late fusion of CNN feature maps is proposed in [13, 32]. s. t. (cid:104)F,f (cid:105)≤(cid:104)F,f (cid:105)−1+ξ , ∀i<j, i j ij Almost all these methods combine the feature maps with- outaccountingforthetemporalevolutionofthesefeatures; where(cid:104).,.(cid:105)representsaninnerproductbetweentheoriginal asaresult, mightlosecapturingtemporalactiondynamics flow vectors and the dynamic flow image to be found. As effectively. one may recall, this formulation is similar to the Ranking- To account for this shortcoming, Fernando et al. [10] SVMformulation[12],wheretheinnerproductcapturesthe proposed a pooling formulation on video features that ac- ranking order, which in our case corresponds to the tem- counts for the temporal order of frames. This formulation poral order of the frames in the sequence. While, ideally, isextendedforRGBframesin[4,8]. Webasethispaperon weenforcethattheprojection(viatheinner-product)ofthe thisextension,howeverdeviatefromtheirapproachinthat flow frames to the dynamic flow image is lower bounded weuseopticalflowframesinsteadofRGBframesforcap- by one, we relax this constraint using the variables ξ as is turing the action dynamics. To the best of our knowledge, usuallydoneinamax-marginframework. While,theabove we are not aware of any other work that has analyzed the formulationisadirectadaptationoftheRankingSVMfor- advantages of using optical flow images in the rank-SVM mulation, there are a few subtleties that need to be taken formulationforactionrecognition. Weshowcaseextensive care when using this formulation for optical flow images. experimentstosubstantiatethebenefitsagainstusingRGB Wewilldiscusstheseissuesnext. frames. Itisoftenobservedthatadirectuseofrawopticalflow We also note that there have been several other deep in(1)isofteninadequate.Thisisbecausecomputingoptical learningmodelsdevisedforactionmodelingsuchasusing flow is a computationally expensive and difficult task, and recurrentneuralnetworks[3,6]andlong-shorttermmem- the flow solvers are often prone to local minima, leading orynetworks[5,17,32].While,weacknowledgethatusing to inaccurate flow estimates. In order to circumvent such recurrentarchitecturesmaybenefitactionrecognition,usu- issues,weinsteadapplythealgorithmonflowimagescom- ally datasets for this task are often small or are noisy, that puted over running averages; such a scheme averages the trainingsuchmodelswillbechallenging.Thus,wefocuson noiseintheflowestimates.FortheflowsetF,werepresent advancingactionrepresentationsinthispaperforeffective theresultingsmoothedflowimageforframeftas: CNNtraining. t fˆ = 1(cid:88)f . (2) t t i 3.ProposedMethod i=1 In this section, we introduce our dynamic flow images; Wecomputethedynamicflowonsuchsmoothedflowim- we describe the necessary formulations to generate these ages. Note that since flow is signed, such averaging will images, followed by an exposition to a multi-stream CNN cancelwhitenoise. framework for action recognition. Note that our dynamic Anotherissuespecifictoopticalflowimagesisthatthe flowformulationisinspiredbytherecentlyintroduceddy- two flowchannels u and v are coupled. Theyare no more namic image framework [4], however, our contribution is the color channels as in an RGB image, instead they rep- to show the shortcomings of that formulation and propose resent velocity vectors, which together capture the motion remedies. vector at a pixel. However, our ranking formulation as- sumesindepedenceinthechannels. Onewaytoavoidthis difficulty is to decorrelate these channels via diagonaliza- 3.1.Formulation tion; i.e., use the singular vectors associated with a short Let us assume that we are provided with a sequence of set of flow images as representatives of the original flow n consecutive optical flow images F = [f ,f ,··· ,f ], images. However,ourexperimentsshowedthatthisdidnot 1 2 n where each fi ∈ Rd1×d2×2, where d1,d2 are the height leadtosignificantimprovements.Thus,wechoosetoignore and width of the image. We assume a two-channel flow thiscouplinginthecurrentpaper,andassumeeachchannel imagecorrespondingtothehorizontalandverticalcompo- isindependentwhencreatingthedynamicflowimages.Our nentsoftheflowvector,whichwedenotebyfu andfv re- experiments(inSection4)showthatthisassumptionisnot i i spectively. Ourgoalistogenerateasingle”dynamicflow” detrimental. image F ∈ Rd1×d2×2 that captures the temporal order of Incorporating the above assumptions into the objective, theflowimagesinF. Weusethefollowingformulationto and assuming Fu,Fv ∈ Rd1×d2 are the two dynamic flow Figure2.Qualitativecomparisonsbetweendynamicimages[4](secondrow)andourproposeddynamicflowimage(thirdrow)represen- tation. AnRGBframefromtherespectivevideoisalsoshown(firstrow). Ascanbeseenfromthevisualizations,dynamicflowfocuses onactionswhiledynamicimagescanbecontaminatedbybackgroundpixels. channels corresponding to the horizontal and vertical flow intoseveralsub-sequenceseachoflengthw andgenerated directions,wecanrewrite(1)as: atatemporalstrides. Foreachsub-sequence,weconstruct adynamicflowimageusingtheopticalflowimagesinthis min (cid:107)Fu(cid:107)2+(cid:107)Fv(cid:107)2+C(cid:88)ξij window.Weassociatethesamegroundtruthactionlabelfor Fu,Fv∈Rd1×d2,ξ≥0 i<j all the sub-sequences, thus effectively increasing the num- subjectto beroftrainingvideosbyafactorof n,wherenistheaver- s agenumberofframesinthesequences. Notethatweusea (cid:104)F ,fˆu(cid:105)+(cid:104)F ,fˆv(cid:105)≤(cid:104)F ,fˆu(cid:105)+(cid:104)F ,fˆv(cid:105)−1+ξ ,∀i<j. u i v i u j v j ij separateCNNstreamondynamicflowimages. Giventhat (3) action recognition datasets are usually tiny, in comparison toimagedatasets(suchasImageNet),increasingthetrain- Wesolvetheaboveoptimizationproblemusingthelib- ingsetisusuallynecessaryfortheeffectivetrainingofthe SVMpackageasdescribedin[4].Oncethetwoflowimages network. F and F are created, they are stacked to generate a two u v channeldynamicflowimage, whichisthenfedtoamulti- 3.3.PracticalExtensions streamCNNasdescribedforlearningactions(asdescribed WeusetheTVL1opticalflow[33]algorithmtogenerate in the next section). In Figure 2, we show several exam- theflowimagesusingitsOpenCVimplementation. Forev- plesofthedynamicflowimages,andtheirrespectiveRGB eryflowimage,wesubtractthemedianflow,thusremoving frames, and dynamic RGB images. As is clear from these camera motion if any (assuming flow from the action dy- visualizations,thedynamicflowimagessummarizestheac- namics occupies a small fraction with respect to the back- tual action dynamics in the sequences, while the dynamic ground). Theresultingflowimagesarethenthresholdedin RGBimagesincludeaveragedbackgroundpixelsandthus the range of [−20,20] pixels and setting every other flow aremayincludedynamicsunrelatedtorecognizingactions. vectortozero. Thisstepthusremovesunwantedflowvec- 3.2.Three-StreamPredictionFramework tors, that might correspond to noise in the images. Next, wescaletheflowtothediscreterangeof[0,255],andcon- Next, we propose a three-stream CNN setup for action verteachflowchannelasagray-scaleimage. Asetofsuch recognition. This architecture is an extension of the pop- transformedflowimagesarethenusedasinputtotherank- ular two-stream model [23] that takes as input individual SVM formulation in (2), thereby generating one dynamic RGBframesinonestreamandasmallstackofopticalflow flow image per sub-sequence. Using libSVM package, it frames(about10flowimages)intheother. Oneshortcom- takesabout0.25secondstogenerateonedynamicflowim- ingofthismodelisthatitcannotseelong-rangeactionevo- age on a single core CPU combining 25 flow frames each lution, for which we propose to use our dynamic flow im- ofsize224×224×2. ages(thatsummarizesabout25flowframesinonedynamic flowimage). Asisclear,eachsuchstreamlooksatcomple- 4.Experiments mentaryactioncues.Ouroverallframeworkisillustratedin Figure3. Inthissection,wefirstdescribethedatasetsusedinour To be precise, for the dynamic flow stream, for each experiments, followed by an exposition to the implemen- videosequence,wegeneratemultipledynamicflowimages. tation details of our framework and comparisons to prior In order to achieve this, we first split the input flow video works. Figure 3. Architecture of our dynamic-flow CNN based classification setup. Our three-stream CNN consists of an optical flow stream takingstacksofflowframes,anRGBstreamtakingsingleRGBframes,andourdynamicflowstreamtakingsingleimages,eachimage summarizingtheactiondynamicsover25flowframes. 4.1.Datasets whenthevalidationlossbeginstoincrease. Thenetworkis trainedusingSGDwithamomentumof0.9andweightde- We use two standard datasets for the task: (i) the cayof0.0005.Weuseamini-batchsizeof64forHMDB51 HMDB51 dataset [16] and (ii) the UCF101 dataset [25]. and128forUCF101. For both the datasets, the standard evaluation protocol is theaverageclassificationaccuracyoverthreesplits. Fusion Strategy. As alluded to earlier, we use a three- stream network with RGB, stacked-optical flow, and dy- HMDB51 dataset [16] consists of 6766 video clips dis- namic flow images. The three networks are trained sepa- tributedin51actioncategories.Thevideosaredownloaded rately.Duringtesting,outputofthefc6-layerofeachstream fromYoutubeandaregenerallyoflowresolution. isextracted(seeFigure3). Theseintermediatefeaturesare thenconcatenatedintoasinglevector, whichisthenfused UCF101 dataset [25] is a relatively larger dataset, con- viaalinearSVM.Wealsoexperimentedwithfeaturesfrom taining 13,320 videos spanning over 101 actions. The thefc7layer,howeverwefoundthemtoperformslightlyin- videosaremostlyfromsportsactivitiesandcontainsignifi- feriorcomparedtofc6.Asistypicallydone,wealsoextract cantcameramotionsandpeopleappearancevariations. dense trajectory features from the videos (such as HOG, HOF,andMBH)[27],whichareencodedusingFishervec- 4.2.ImplementationDetails tors, and are concatenated with the CNN features before Training CNNs. In the experiments to follow, we use passingtothelinearSVM. thetwosuccessfulCNNarchitectures,namelyAlexnet[15] 4.3.ExperimentalResults and VGG-16 [24]. We use the Caffe toolbox [11] for the implementation. As the number of training videos is sub- We organize our experiments into various categories, stantially limited to train a standard deep network from namely (i) to decide the hyper-parameters of our formula- scratch,wedecidedtofine-tunethenetworksfrommodels tion (e.g., the window size to generate dynamic flow im- pre-trained for image recognition tasks (on the ImageNet ages), (ii) benefits of using dynamic flow images against dataset). On the training subsets of our dataset, we fine- dynamicRGBimagesandotherdatamodalities,(iii)influ- tuneallthelayersoftherespectivenetworkswithaninitial ence of the CNN architecture, (iv) complementarity of the learning rate of 10−4 for VGG-16 and 10−3 for Alex net dynamicflowimagestohand-craftedfeatures,and(v)com- withadrop-outof0.85for’fc7’and0.9for’fc6’asrecom- parisonsagainstthestateoftheart. Below, wedetaileach mendedin[24,7]. Thedrop-outissubsequentlyincreased oftheseexperiments. Table1.Influenceofusingdifferentnumberofconsecutiveflow Alexnet network on the UCF101 dataset in Table 32. We frames(windowsize)toconstructonedynamicflowimage. This alsoshowcomparisonstoRGB,stackofflow, andvarious experimentusesthesplit1ofHMDB51datasetwiththeVGG-16 combinationsofthem. Asisclearfromthetwotables,DF CNNmodel. + RGB is about 9-10% better than DI + RGB consistently DynamicFlowwindowsize Accuracy on both the datasets and both CNN architectures. Surpris- 15 48.23% ingly, combining DI with DF + RGB leads to a reduction 25 48.75% inperformanceofabout4%(Table2). This,wesuspect,is 30 46.60% because DI includes pixel dynamics from the scene back- ground that may be unrelated to the actions and may con- Table2.EvaluationonHMDB51split1usingVGG-16model. Method Accuracy fusethesubsequentclassifier;suchnoiseisavoidedbycom- StaticRGB[7] 47.06% putingopticalflow. Weinvestigatethisfurther, inTable4, StackedOpticalFlow[7] 55.23% where we provide the per-class accuracy on HMDB51 for DynamicImage[4] 44.74% DI+RGB,DF+RGB,andDI+DF+RGB.Wefindthat29of DynamicFlow 48.75% the51actionsinthisdatasetloseinperformancewhenus- DynamicImage+RGB 47.96% ing DI+DF+RGB as against DF+RGB, which implies that DynamicFlow+RGB 58.30% DF+RGBisindeedabetterrepresentationforactions. DynamicImage+DynamicFlow+RGB 54.86% (S)OpticalFlow+RGB[7] 58.17% BenefitsofThree-streamModel: InTables2and3, we DynamicImage+RGB+(S)OpticalFlow 55.40% alsoevaluateourthree-streamnetworkagainsttheoriginal DynamicFlow+RGB+(S)OpticalFlow 61.70% two-stream framework [24]. It can be seen that the per- formance of DF + RGB is similar to Stack(S) of optical Table3.EvaluationonUCF101usingAlexNetCNNmodel. flow + RGB (which is the standard two-stream model) on Method Accuracy both UCF101 and HMDB51 datasets. However, interest- Onsplit1 ingly, we find that, after combining stacked optical flow StaticRGB[29] 71.20% withDF+RGB,theaccuracyimprovesby3%onHMDB51 StackedOpticalFlow[29] 80.10% and4%onUCF101,whichshowsthatournewrepresenta- DynamicFlow 75.36% tioncarriescomplementaryinformation(suchaslong-range DynamicFlow+RGB 84.93% dynamics)thatisabsentpreviously. (S)OpticalFlow+RGB[29] 84.70% DynamicFlow+RGB+(S)OpticalFlow 88.63% Comparisons between CNN architectures: To further Overthreesplits1 validate and understand the behavior of the three-stream StaticRGB 70.10% model, we repeated the experiment in the last section us- DynamicFlow 76.19% inganAlexnetarchitecture. AsisshowninTable5and6, DynamicImage[4] 70.90% the accuracy of VGG-16 is seen to be 2%-7% higher than DynamicImage+RGB[4] 76.90% forAlexnet, whichisexpected. Wealsofindthattheper- DynamicFlow+RGB 84.93% formanceofDF,DF+RGBandDF+RGB+stackedopti- calflowshowthesametrendinbothAlexnetandVGG-16 networks. Hyper-parameters: We tested the performance of dy- namic flow images using different temporal window sizes, BenefitsfromHand-craftedFeatures: InTable7and8, i.e.,thenumberofconsecutiveflowframesusedforgener- we evaluate the performance of three-stream network and ating one dynamic flow image. The results of this experi- itscombinationwithIDT-FV[27]. OnHMDB51,afterap- mentontheHMDB51datasetsplit1isprovidedinTable1. plying IDT-FV, the accuracy of each method improves by As is clear, increasing window sizes is not beneficial as it 2% to 10%. While, using this combination also improves may lead to more action clutter. Motivated by this experi- the accuracy for DI + RGB, this improvement is signifi- ment,weuseawindowsizeof25atatemporalstrideof5 cantly inferior to DF+RGB. On the UCF101 dataset, IDT- inalltheexperimentsinthesequel. FVimprovesperformanceby1–2%. Benefits over Dynamic Images (DI): As discussed in Comparisons to the State of the Art: In Table 9, we Section3,dynamicflow(DF)capturestheactiondynamics compare our method against the state-of-the-art results on directlyincomparisontothedynamicRGBimages[4]. To HMDB51 and UCF101 datasets. For this comparison, demonstrate this, we show experiments using the VGG-16 network in Table 2 on the HMDB51 dataset and using the 2TheresultsofAlexnetonUCF101wastakendirectlyfrom[4]. Table4.AccuracyoneachclassinHMDB51split1usingdifferent Table5.AccuracycomparisonbetweenAlexNetandVGG-16on methods. HMDB51split1. Action DI+RGB DF+RGB DI+DF+RGB Method Alexnet VGG-16 brush hair 60% 77% 77% StaticRGB 40.50%[23] 47.12% cartwheel 3% 23% 13% DynamicFlow 43.69% 48.75% catch 27% 43% 43% DynamicImage 40.88% 44.74% chew 43% 60% 60% DynamicFlow+RGB 53.75% 58.30% clap 38% 52% 52% DynamicImage+RGB 45.21% 47.96% climb stairs 60% 53% 60% climb 67% 73% 73% dive 50% 60% 53% Table6.AccuracycomparisonbetweenAlexNetandVGG-16on UCF101split1. draw sword 30% 47% 40% Method Alexnet VGG-16 dribble 73% 90% 80% StaticRGB 71.20%[29] 80.00% drink 20% 43% 43% DynamicFlow 75.36% 78.00% eat 33% 50% 37% DynamicFlow+RGB 84.25% 87.63% fall floor 34% 24% 38% DynamicFlow+RGB+(S)Opticalflow 88.65% 90.30% fencing 63% 77% 70% flic flac 30% 50% 50% golf 93% 97% 93% Table7.AccuracycomparisononHMDB51split1usingVGG-16 handstand 50% 70% 60% aftercombiningwithIDT-FV. Method +IDT-FV Original hit 14% 25% 29% StaticRGB 61.89% 47.06% hug 53% 67% 57% DynamicFlow 58.30% 48.75% jump 31% 52% 41% DynamicImage 47.96% 44.74% kick ball 37% 40% 33% DynamicImage+RGB 65.44% 47.96% kick 7% 20% 17% DynamicFlow+RGB 67.35% 58.30% kiss 83% 83% 87% DynamicFlow+RGB+(S)Opticalflow 67.48% 61.70% laugh 50% 70% 57% DynamicImage+RGB+(S)Opticalflow 64.13% 54.40% pick 30% 40% 37% pour 83% 97% 93% Table8.AccuracycomparisonsonUCF101split1usingVGG-16 pullup 87% 100% 100% aftercombiningwithIDT-FV. punch 63% 63% 70% Method +IDT-FV Original push 70% 87% 77% DynamicFlow+RGB 89.20% 87.63% pushup 57% 60% 67% DynamicFlow+RGB+(S)Opticalflow 91.10% 90.30% ride bike 93% 93% 97% ride horse 80% 83% 77% run 32% 50% 39% shake hands 70% 73% 77% we use the VGG-16 model trained on dynamic flow im- shoot ball 80% 87% 83% ages,combinedwithstaticRGB,astackof10opticalflow shoot bow 87% 93% 87% frames,andIDT-FVfeatures. Theresultsareaveragedover shoot gun 67% 60% 70% threesplitsasisthestandardprotocol.Forboththedatasets, sit 37% 57% 53% wefindthatourthreestreammodelwithdynamicflowim- situp 87% 77% 83% agesoutperformsthebestresultsusingdynamicimagenet- smile 40% 43% 40% works [4]. For example, on HMDB51, our results by re- smoke 40% 47% 53% placingthedynamicimagewithdynamicflowleadstoa2% somersault 60% 83% 70% improvement. We also notice a performance boost against stand 20% 23% 20% the recent hierarchical variant [8] of the dynamic images swing baseball 13% 17% 17% (66.9% against 67.35%) that recursively summarizes such sword exercise 13% 30% 17% imagesfromvideosub-sequences. sword 31% 17% 21% Compared to other state-of-the-art methods, our results talk 57% 67% 60% are very competitive. Notably, while the accuracy of two- throw 10% 33% 7% streamfusion[7]isslightlyhigherthanours, ourCNNar- turn 37% 57% 47% chitecture is significantly simpler. In Figures 4 and 5, we walk 40% 43% 50% providequalitativecomparisonsbetweendynamicflowand wave 7% 20% 20% dynamicimages. Average 48% 58% 55% Figure4.Lefttoright:QualitativeresultsofRGBframes,dynamic Figure5.Lefttoright:QualitativeresultsofRGBframes,dynamic images,anddynamicflowonUCF101dataset. images,anddynamicflowonHMDB51dataset. Table 9. Classification accuracy against the state of the art on calflowframesinavideosequenceasasingletwo-channel HMDB51andUCF101datasetsaveragedoverthreesplits. image. We showed that our representation can compactly Method HMDB51 UCF101 capture the long-range dynamics of actions. Considering Two-stream[23] 59.40% 88.00% Two-streamSR-CNNs[30] – 92.60% thisrepresentationasanadditionalCNNinputcue,wepro- VeryDeepTwo-streamFusion[7] 69.20% 93.50% posedanovelthree-streamCNNarchitecturethatincorpo- LSTM-MSD[17] 63.57% 90.80% ratessingleRGBframes(foractioncontext),stackofflow IDT-FV[28] 57.20% 86.00% images(forlocalactiondynamics),andournoveldynamic IDT-HFV[20] 61.10% 87.90% flowstream(forlongrangeactionevolution). Experiments TDD+IDT-FV[29] 65.90% 91.50% were provided on standard benchmark datasets (HMDB51 C3D+IDT-FV[26] – 90.40% and UCF101) and clearly demonstrate that our method is DynamicImage+RGB+IDT-FV[4] 65.20% 89.10% promising in comparison to the state of the art. More im- HierarchicalRankPooling[9] 66.90% 91.40% portantly,ourexperimentalresultsrevealthatourrepresen- Ours tationcapturestheactiondynamicsmorerobustlythanthe Dyn.Flow+RGB+IDT-FV 67.19% 89.41% recentdynamicimagealgorithmandprovidescomplimen- Dyn.Flow+RGB+(S)Op.Flow+IDT-FV 67.35% 91.32% tary information (long-range) compared to the traditional stackofflowframes. 5.Conclusions Acknowledgements: AC is supported by the Australian Inthispaper,wepresentedanovelvideorepresentation ResearchCouncil(ARC)throughtheCentreofExcellence –dynamicflow–thatsummarizesasetofconsecutiveopti- forRoboticVision(CE140100016). References [19] L. Liu, L. Shao, X. Li, and K. Lu. Learning spatio- temporal representations for action recognition: a genetic [1] J.K.AggarwalandM.S.Ryoo. Humanactivityanalysis:A programmingapproach. IEEETransactionsonCybernetics, review.ACMComputingSurveys(CSUR),43(3):16,2011.1 46(1):158–170,2016. 2 [2] B. B. Amor, J. Su, and A. Srivastava. Action recognition [20] X. Peng, L. Wang, X. Wang, and Y. Qiao. Bag of visual using rate-invariant analysis of skeletal shape trajectories. wordsandfusionmethodsforactionrecognition: Compre- PAMI,38(1):1–13,2016. 2 hensivestudyandgoodpractice.ComputerVisionandImage [3] M. Baccouche, F. Mamalet, C. Wolf, C. Garcia, and Understanding,2016. 2,8 A. Baskurt. Sequential deep learning for human action [21] R. Poppe. A survey on vision-based human action recog- recognition. InHumanBehaviorUnderstanding,pages29– nition. Imageandvisioncomputing, 28(6):976–990, 2010. 39.2011. 3 1 [4] H.Bilen,B.Fernando,E.Gavves,A.Vedaldi,andS.Gould. [22] S.SadanandandJ.J.Corso. Actionbank:Ahigh-levelrep- Dynamicimagenetworksforactionrecognition. InCVPR, resentationofactivityinvideo. InCVPR,2012. 2 2016. 2,3,4,6,7,8 [23] K.SimonyanandA.Zisserman. Two-streamconvolutional [5] J.Donahue,L.A.Hendricks,S.Guadarrama,M.Rohrbach, networksforactionrecognitioninvideos. InNIPS,2014. 1, S.Venugopalan,K.Saenko,andT.Darrell.Long-termrecur- 2,4,7,8 rent convolutional networks for visual recognition and de- [24] K. Simonyan and A. Zisserman. Very deep convolutional scription. InCVPR,2015. 3 networksforlarge-scaleimagerecognition. InICLR,2015. [6] Y.Du,W.Wang,andL.Wang. Hierarchicalrecurrentneu- 5,6 ralnetworkforskeletonbasedactionrecognition. InCVPR, [25] K. Soomro, A. Roshan Zamir, and M. Shah. UCF101: A 2015. 3 datasetof101humanactionsclassesfromvideosinthewild. [7] C.Feichtenhofer,A.Pinz,andA.Zisserman. Convolutional InCRCV-TR-12-01,2012. 2,5 two-streamnetworkfusionforvideoactionrecognition. In [26] D.Tran,L.Bourdev,R.Fergus,L.Torresani,andM.Paluri. CVPR,2016. 1,2,5,6,7,8 Learningspatiotemporalfeatureswith3Dconvolutionalnet- [8] B.Fernando,P.Anderson,M.Hutter,andS.Gould.Discrim- works. InICCV,2015. 8 inativehierarchicalrankpoolingforactivityrecognition. In [27] H.Wang, A.Kla¨ser, C.Schmid, andC.-L.Liu. Densetra- CVPR,2016. 3,7 jectoriesandmotionboundarydescriptorsforactionrecog- [9] B.Fernando,P.Anderson,M.Hutter,andS.Gould.Discrim- nition. Internationaljournalofcomputervision,103(1):60– inativehierarchicalrankpoolingforactivityrecognition. In 79,2013. 5,6 CVPR,2016. 8 [28] H.WangandC.Schmid. Actionrecognitionwithimproved [10] B. Fernando, E. Gavves, J. M. Oramas, A. Ghodrati, and trajectories. InICCV,2013. 1,2,8 T.Tuytelaars. Modelingvideoevolutionforactionrecogni- [29] L. Wang, Y. Qiao, and X. Tang. Action recognition with tion. InCVPR,2015. 2,3 trajectory-pooleddeep-convolutionaldescriptors. InCVPR, [11] Y.Jia,E.Shelhamer,J.Donahue,S.Karayev,J.Long,R.Gir- 2015. 1,2,3,6,7,8 shick,S.Guadarrama,andT.Darrell. Caffe: Convolutional [30] Y. Wang, J. Song, L. Wang, L. Van Gool, and O. Hilliges. architectureforfastfeatureembedding. InACMIntl.Conf. Two-stream SR-CNNs for action recognition in videos. In onMultimedia.ACM,2014. 5 BMVC,2016. 8 [12] T.Joachims. Optimizingsearchenginesusingclickthrough [31] J.Wu,Y.Zhang,andW.Lin. Goodpracticesforlearningto data. InSIGKDD.ACM,2002. 3 recognizeactionsusingFVandVLAD. IEEETransactions [13] A.Karpathy,G.Toderici,S.Shetty,T.Leung,R.Sukthankar, onCybernetics,2015. 2 andL.Fei-Fei. Large-scalevideoclassificationwithconvo- [32] J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan, lutionalneuralnetworks. InCVPR,2014. 3 O.Vinyals,R.Monga,andG.Toderici. Beyondshortsnip- [14] P. Koniusz, A. Cherian, and F. Porikli. Tensor representa- pets:Deepnetworksforvideoclassification.InCVPR,2015. tionsviakernellinearizationforactionrecognitionfrom3D 3 skeletons. ECCV,2016. 2 [33] C.Zach,T.Pock,andH.Bischof. Adualitybasedapproach [15] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet forrealtimeTV-L1opticalflow.InJointPatternRecognition classification with deep convolutional neural networks. In Symposium,2007. 4 NIPS,2012. 2,5 [16] H.Kuehne,H.Jhuang,E.Garrote,T.Poggio,andT.Serre. HMDB: a large video database for human motion recogni- tion. InICCV,2011. 2,5 [17] Q. Li, Z. Qiu, T. Yao, T. Mei, Y. Rui, and J. Luo. Action recognitionbylearningdeepmulti-granularspatio-temporal videorepresentation. InICMR,2016. 3,8 [18] A.A.Liu,Y.T.Su,W.Z.Nie,andM.Kankanhalli. Hierar- chicalclusteringmulti-tasklearningforjointhumanaction groupingandrecognition. PAMI,39(1):102–114,2017. 2

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.