ebook img

Higher-order Pooling of CNN Features via Kernel Linearization for Action Recognition PDF

1.8 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Higher-order Pooling of CNN Features via Kernel Linearization for Action Recognition

Higher-order Pooling of CNN Features via Kernel Linearization for Action Recognition AnoopCherian1,3 PiotrKoniusz2,3 StephenGould1,3 1 AustralianCenterforRoboticVision 2Data61/CSIRO, 3 TheAustralianNationalUniversity,Canberra,Australia [email protected] [email protected] [email protected] 7 1 0 2 Abstract n a Most successful deep learning algorithms for action J recognition extend models designed for image-based tasks 9 such as object recognition to video. Such extensions are 1 typicallytrainedforactionsonsinglevideoframesorvery ] shortclips,andthentheirpredictionsfromsliding-windows V overthevideosequencearepooledforrecognizingtheac- C tion at the sequence level. Usually this pooling step uses . the first-order statistics of frame-level action predictions. s Figure 1: Fine-grained action instances from two different c In this paper, we explore the advantages of using higher- action categories: cut-in (left) and slicing (right). These [ ordercorrelations;specifically,weintroduceHigher-order instancesarefromtheMPIIcookingactivitiesdataset[28]. 1 Kernel(HOK)descriptorsgeneratedfromthelatefusionof v CNNclassifierscoresfromalltheframesinasequence. To 2 generate these descriptors, we use the idea of kernel lin- 3 of action recognition, namely fine-grained action recogni- earization. Specifically, a similarity kernel matrix, which 4 tion. Thisproblemcategoryischaracterizedbyactionsthat captures the temporal evolution of deep classifier scores, 5 have very weak intra-class appearance similarity (such as 0 is first linearized into kernel feature maps. The HOK cutting tomatoes versus cutting cucumbers), while strong . descriptors are then generated from the higher-order co- 1 inter-classsimilarity(peelingcucumbersversusslicingcu- 0 occurrences of these feature maps, and are then used as cumbers). Figure1illustratestwofine-grainedactions. 7 inputtoavideo-levelclassifier. Weprovideexperimentson 1 twofine-grainedactionrecognitiondatasets,andshowthat Unsurprisingly, the most recent trend for fine-grained v: ourschemeleadstostate-of-the-artresults. activity recognition is based on convolutional neural net- i works (CNN) [6, 14]. These schemes extend frameworks X developed for general purpose action recognition [31] into r a 1.Introduction thefine-grainedsettingbyincorporatingheuristicsthatex- tract auxiliary discriminative cues, such as the position of With the resurgence of efficient deep learning algo- body parts or indicators of human-to-object interaction. A rithms, recent years have seen significant advancements in technicaldifficultywhenextendingCNN-basedmethodsto severalfundamentalproblemsincomputervision,including videosisthatunlikeobjectsinimages,theactionsinvideos human action recognition in videos [31, 32, 15, 7]. How- arespreadacrossseveralframes. Thustocorrectlyinferac- ever, despitethisprogress, solutionsforactionrecognition tionsaCNNmustbetrainedontheentirevideosequence. are far from being practically useful, especially in a real- However, the current computational architectures are pro- world setting. This is because real-world actions are often hibitiveinusingmorethanafewtensofframes, thuslim- subtle,mayusehard-to-detecttools(suchasknives,peelers, iting the size of the temporal receptive fields. For exam- etc.), may involve strong appearance or human pose vari- ple,thetwostreammodel[31]usessingleframesoratiny ations, may be of different durations, or done at different sets of optical flow images for learning the actions. One speeds, among several other complicating factors. In this mayovercomethisdifficultybyswitchingtorecurrentnet- paper, we study a difficult subset of the general problem works[1]whichtypicallyrequirelargetrainingsetsforef- fectivelearning. shift-invariant kernels (Section. 3). Using this technique, Asisclear,usingsingleframestotrainCNNsmightbe wepropose Higher-orderKernel descriptors(Section. 4.3) insufficient to capture the dynamics of actions effectively, that capture the higher-order co-occurrences of the kernel while a large stack of frames requires a larger number of maps against the pivots. We apply a non-linear operator CNN parameters that result in model overfitting, thereby (suchaseigenvaluepowernormalization[19])totheHOK demandinglargertrainingsetsandcomputationalresources. descriptors as it is known to lead to superior classification ThisproblemalsoexistsinotherpopularCNNarchitectures performance. The HOK descriptor is then vectorized and suchas3DCNNs[32,14]. Thus, state-of-the-artdeepac- used for training actions. Note that the HOK descriptors, tionrecognitionmodelsareusuallytrainedtogenerateuse- thatweproposehere,belongtothemoregeneralfamilyof ful features from short video clips that are then pooled to third-ordersuper-symmetrictensor(TOSST)descriptorsin- generateholisticsequenceleveldescriptors,whicharethen troducedin[17,18]. usedtotrainalinearclassifieronactionlabels. Forexam- Weprovideexperiments(Section5)usingtheproposed ple,inthetwo-streammodel [31],thesoft-maxscoresfrom scheme on two standard action datasets, namely (i) the thefinalCNNlayersfromtheRGBandopticalflowstreams MPII Cooking activities dataset [28], and (ii) the JHMDB are combined using average pooling. Note that average dataset [13]. Our results demonstrate that higher-order pooling captures only the first-order correlations between poolingisusefulandcanperformcompetitivelytothestate- thescores;ahigher-orderpooling[19]thatcaptureshigher- of-the-artmethods. order correlations between the CNN features can be more 2.RelatedWork appropriate, which is the main motivation for the scheme proposedinthispaper. Initial approaches to tackle fine-grained action recogni- Specifically, we assume a two-stream CNN framework tion have been direct extensions of schemes developed for as suggested in [31] with separate RGB image and optical thegeneralclassificationproblems,mainlybasedonhand- flow streams. Each of these streams is trained on single crafted features. A few notable such approaches include RGBoropticalflowframesfromthesequencesagainstthe [26,28,29,35],inwhichfeaturessuchasHOG,SIFT,HOF, actionlabels. Asnotedabove,whileCNNclassifierpredic- etc., are first extracted at spatio-temporal interest point lo- tions at the frame-level might be very noisy, we posit that cations (e.g., following dense trajectories) and then fused the correlations between the temporal evolution of classi- andusedtotrainaclassifier. However,therecenttrendhas fier scores can capture useful action cues which may help movedtowardsdatadrivenfeaturelearningviadeeplearn- improve recognition performance. Intuitively, some of the ingplatforms[21,31,14,32,7,39]. Asalludedtoearlier, actions might have unlabelled pre-cursors (such as Walk- thelackofsufficientannotatedvideodata,andtheneedfor ing, Picking, etc.). Using higher-order action occurrences expensivecomputationalinfrastructure,makesdirectexten- inasequencemaybeabletocapturesuchpre-cursors,lead- sion of these frameworks (which are primarily developed ingtobetterdiscriminationoftheaction,whiletheymaybe forimagerecognitiontasks)challengingforvideodata,thus ignored as noise when using a first-order pooling scheme. demandingefficientrepresentations. In this paper, we use this intuition to develop a theoretical Another promising setup for fine-grained action recog- framework for action recognition using higher-order pool- nition has been to use mid-level features such as human ingonthetwo-streamclassifierscores. pose. Clearly, estimating the human pose and develop- Ourpoolingschemeisbasedonkernellinearization—a ing action recognition systems on it disentangles the ac- simple technique that decomposes a similarity kernel ma- tion inference from operating directly on pixel-level, thus trixcomputedontheinputdataintermsofasetofanchor allowing for higher-level of sophisticated action reason- points(pivots).Usingakernel(specifically,aGaussianker- ing [28, 34, 41, 6]. Although, there have been significant nel) offers richer representational power, e.g., by embed- advancementsintheposeestimationrecently[37,11],most dingthedatainaninfinitedimensionalHilbertspace. The of these models are computationally demanding and thus use of the kernel also allows us to easily incorporate addi- difficulttoscaletomillionsofvideoframesthatformstan- tionalpriorknowledgeabouttheactions. Forexample,we darddatasets. showinSection4howtoimposeasoft-temporalgrammar Adifferentapproachtofine-grainedrecognitionistode- ontheactionsinthesequence. Kernelsareaclassofposi- tect and analyze human-object interactions in the videos. tivedefiniteobjectsthatbelongtoanon-linearbutRieman- Such a technique is proposed in Zhou et al. [40]. Their niangeometry,andthusdirectlyevaluatingthemforuseina method starts by generating region proposals for human- dualSVM(c.f. primalSVM)iscomputationallyexpensive. objectinteractionsinthescene,extractsvisualfeaturesfrom Therefore, these kernels are usually linearized into feature these regions and trains a classifier for action classes on mapssothatfastlinearclassifierscanbeappliedinstead. In these features. A scheme based on tracking human hands this paper, we use a linearization technique developed for and their interactions with objects is presented in Ni et al.[25]. Houghforestsforactionrecognitionareproposed by Πr a . We use the notation A = S ×n P for j=1 ij k=1 k inGalletal.[9]. Althoughrecognizingobjectsmaybeuse- matrices P and a tensor S to denote that A = k (i1,i2,...,in) ful,theymaynotbeeasilydetectableinthecontextoffine- (cid:80) (cid:80) ...(cid:80) S P(i1,j1)P(i2,j2)...P(in,jn). grainedactions. Thji1snotja2tionarjinses(iin1,it2h,e...T,inu)cke1rdecom2positionnofhigher- We also note that there have been several other deep order tensors, see [23, 16] for details. We note that the learningmodelsdevisedforactionmodelingsuchasusing inner-productbetweentwosuchtensorsfollowsthegeneral 3Dconvolutionalfilters[14],recurrentneuralnetworks[1], element-wise product and summation, as is typically used long-shorttermmemorynetworks[7],andlargescalevideo inlinearalgebra. Weassumeinthesequelthattheorderr classification architectures [15]. These models demand is ordinal and greater than zero. We use the notation (cid:104)·,·(cid:105) hugecollectionsofvideosforeffectivetraining, whichare todenotethestandardEuclideaninnerproductand∆d for usually unavailable for fine-grained activity tasks and thus thed-dimensionalsimplex. theapplicabilityofthesemodelsisyettobeascertained. Poolinghasbeenausefultechniqueforreducingthesize 3.2.KernelLinearization ofvideorepresentations,therebyenablingtheapplicability LetX ={x ∈Rn :t=1,2,...,T}beasetofdatain- t ofefficientmachinelearningalgorithmstothisdatamodal- stancesproducedbysomedynamicprocessatdiscretetime ity. Recently,apoolingschemepreservingthetemporalor- instancest. LetK beakernelmatrixcreatedfromX, the deroftheframesisproposedbyFernandoetal.[8]bysolv- ij-thelementofwhichis: ingaRank-SVMformulation.InWangetal.[36],deepfea- tures are fused along action trajectories in the video. Cor- K =(cid:104)φ(x ),φ(x )(cid:105), (1) ij i j relationsbetweenspace-timefeaturesareproposedin[30]. whereφ(·)representssomekernelizedsimilaritymap. EarlyandlatefusionofCNNfeaturemapsforactionrecog- nitionareproposedin[15,39]. Ourproposedhigher-order Theorem1(LinearExpansionoftheGaussianKernel). For pooling scheme is somewhat similar to the second- and allxandx(cid:48)inRnandσ >0, higher-order pooling approaches proposed in [4] and [19] thatgeneraterepresentationsfromlow-leveldescriptorsfor ψ(cid:18)x−x(cid:48)(cid:19)=e(cid:16)−2σ12(cid:107)x−x(cid:48)(cid:107)22(cid:17) (2) thetaskofsemanticsegmentationofimagesandobjectcat- σ escgroirpytorrecisoginnsitpioirne,drebsypethcteivseelyq.ueMncoerecoovmepr,atoiubriliHtyOkKerdnee-l =(cid:18) 2 (cid:19)n2 (cid:90) e(−σ12(cid:107)x−z(cid:107)22)e(cid:16)−σ12(cid:107)x(cid:48)−z(cid:107)22(cid:17)dz πσ2 z∈Rn (SCK) descriptor introduced in [18] which pools higher- (3) orderoccurrencesoffeaturemapsfromskeletalbodyjoints for action recognition. In contrast, we use the frame-level Proof. See[12,24]. predictionvectors(outputoffc8layers)fromthedeepclas- sifierstogenerateourpooleddescriptors,therefore,thesize We can approximate the linearized kernel ofourpooleddescriptorsisafunctionofthenumberofac- by choosing a finite set of pivots z ∈ Z = tion classes. Moreover, unlike SCK, that uses pose skele- {z1,z2,...,zK;σ1,σ2,...,σK}1. Using Z, we can tons, we use raw action videos directly. Our work is also rewrite(3)as: different from works such as [17, 33] in which tensor de- ψ(x−x(cid:48))≈(cid:104)φ˜(x;Z), φ˜(x(cid:48);Z)(cid:105) (4) scriptorsareproposedonhand-craftedfeatures. Inthispa- per, we show how CNNs could benefit from higher-order where pooling for the application of fine-grained action recogni- (cid:18) (cid:19) tion. φ˜(x;Z)= e−σ112(cid:107)x−z1(cid:107)22,...,e−σ1K2 (cid:107)x−zK(cid:107)22 . (5) 3.Background Wecallφ˜(x;Z)astheapproximatefeaturemapforthein- Inthissection,wefirstreviewthetensornotationthatwe putdatapointx. useinthefollowingsections. Thisprecedesanintroduction In the sequel, we linearize the constituent kernels as tothenecessarytheorybehindkernellinearizationonwhich this allows us to obtain linear maps, which results in ourdescriptorframeworkisbased. favourablecomputationalcomplexities,i.e.,weavoidcom- puting a kernel explicitly between tens of thousands of 3.1.Notation video sequences [18]. Also, see [2, 38] for connections of Let a ∈ Rd be a d-dimensional vector. Then, we our method with the Nystro¨m approximation and random use A = a⊗r to denote the r-mode super-symmetric Fourierfeatures[27]. tensor generated by the r-th order outer-product of a, 1To simplify the notation, we assume that the pivot set includes the where the element at the (i1,i2,...,ir)-th index is given bandwidthsσassociatedwitheachpivotaswell. 4.ProposedMethod whereψ andψ aretwoRBFkernels. Thekernelfunction 1 2 ψ capturesthesimilaritybetweenthetwoclassifierscores 1 Inthissection,wefirstdescribeouroverallCNNframe- at timesteps t and u, while the kernel ψ puts a smooth- 2 work based on the popular two-stream deep model for ac- ing on the length of the interval [t,u]. A small bandwidth tion recognition [31]. This precedes exposition of our σ will demand the two classifier scores be strongly cor- T higher-orderpoolingframework,inwhichweintroduceap- related at the respective time instances, while a larger σ T propriate kernels for the task in hand. We also describe a allows some variance (and hence more robustness) in cap- setupforlearningtheparametersofthekernelmaps. turingthesecorrelations. Inthefollowing,welookatlin- earizing the kernel in (7) for generating higher-order Ker- 4.1.ProblemFormulation nel(HOK)descriptors. Theparameterr capturestheorder LetS = {S ,S ,...,S }beasetofvideosequences, statistics of the kernel, as will be clear in the next section; 1 2 N each sequence belonging to one of M action classes with ζ and ζ are weights associated with the kernels, and we 1 2 labels from the set L = {(cid:96) ,(cid:96) ,...,(cid:96) }. Let S = assumeζ ,ζ >0,ζ +ζ =1. Λisthenormalizationcon- 1 2 M 1 2 1 2 (cid:104)f ,f ,...,f (cid:105) be a sequence of frames of S ∈ S, and stantassociatedwiththekernellinearization(see(3)). Note 1 2 n (cid:83) let F = {f |f ∈S}. In the action recognition thatweassumeallthesequencesareofthesamelengthnin S∈S i i setup, our goal is to find a mapping from any given se- the kernel formulation. This is a mere technicality to sim- quence to its ground truth label. Assume we have trained plifyourderivations.Aswillbeseeninthenextsection,our frame-level action classifiers for each class and that these HOKdescriptordependsonlyonthelengthofonesequence classifierscannotseealltheframesinasequencetogether. (seeDefinition1below). Suppose P : F → [0,1] is one such classifier trained to m 4.3.Higher-orderKernelDescriptors produceaconfidencescoreforaninputframetobelongto the m-th class. Since a single frame is unlikely to repre- Thefollowingeasilyverifiableresult[19]willbehandy sentwelltheentiresequence,theclassifierPmisinaccurate inunderstandingourderivations. at determining the action at the sequence level. However, Proposition 1. Suppose a,b ∈ Rd are two arbitrary vec- ourhypothesisisthatacombinationofthepredictionsfrom tors,thenforanordinalr >0 alltheclassifiersacrossalltheframesinasequencecould capturediscriminativepropertiesoftheactionandcouldim- (cid:0)aTb(cid:1)r =(cid:10)a⊗r,b⊗r(cid:11). (8) proverecognition. Inthesequel,weexplorethispossibility For simplifying the notations, we assume xi = P(fi), inthecontextofhigher-ordertensordescriptors. t t the score vector for frame f in S . Further, suppose we t i 4.2.CorrelationsbetweenClassifierPredictions haveasetofpivotsZ andZ fortheclassifierscoresand F T thetimesteps, respectively. Then, applyingthekernellin- Using the notations defined above, let earization in (5) to (7) using these pivots, we can rewrite (cid:104)f ,f ,...,f (cid:105),f ∈ S denote a sequence corresponding 1 2 n i eachkernelas: to each frame and let P (f ) denote the probability that baeclolansgsitfioecrltarsasin(cid:96)emd.fTohretnh,me mi-th action class predicts fi to ψ1(cid:18)xitσ−Fxju(cid:19)≈zF(cid:88)∈ZF φF(xit;zF)φF(xju;zF) (9) Pˆ(f )=[P (f ),P (f ),...,P (f )]T , (6) =ΦF(xit;ZF)TΦF(xju;ZF) i 1 i 2 i M i (cid:18) (cid:19) t−u (cid:88) ψ ≈ φ (t;z )φ (u;z ) (10) denotes the class confidence vector for frame i. As de- 2 σ T T T T T scribed earlier, we assume that there exists a strong corre- zT∈ZT =Φ (t;Z )TΦ (u;Z ). lationbetweentheconfidencesoftheclassifiersacrosstime T T T T (temporal evolution of the classifier scores) for the frames Substituting(11)into(7),wehave: from similar sequences; i.e., frames that are confused be- tween different classifiers should be confused in a similar n way for different sequences. To capture these correlations K(S ,S )≈ 1 (cid:88) (cid:2)ζ Φ (xi;Z )TΦ (xj;Z )+ betweentheclassifierpredictions,weproposetouseaker- i j Λ 1 F t F F u F t,u=1 nelformulationonthescoresfromsequences,theij-then- ζ Φ (t;Z )TΦ (u;Z )(cid:3)r (11) tryofthiskernelisasfollows: 2 T T T T K(S ,S )=1 (cid:88)n (cid:34)ζ ψ(cid:32)Pˆ(ft)−Pˆ(fu)(cid:33)+ζ ψ (cid:18)t−u(cid:19)(cid:35)r, =1 (cid:88)n (cid:42)(cid:20)√√ζ1ΦF(xit;ZF)(cid:21)⊗r,(cid:20)√√ζ1ΦF(xju;ZF)(cid:21)⊗r(cid:43), i j Λt,u=1 1 1 σF 2 2 σT Λt,u=1 ζ2ΦT(t;ZT) ζ2ΦT(u;ZT) (7) (12) Figure 2: Our overall Higher-order Kernel (HOK) descriptor generation and classification pipeline. PN stands for eigen power-normalization. whereweappliedProposition1to(11).Aseachcomponent 4.4.PowerNormalization intheinnerproductin(12)isindependentintherespective It is often found that using a non-linear operator on temporal indexes, we can carry the summation inside the higher-order tensors leads to significantly better perfor- termsleadingto: mance [20]. For example, for BOW, unit-normalization is (cid:42) 1 (cid:88)n (cid:20) √ζ Φ (xi;Z ) (cid:21)⊗r known to avoid the impact of background features, while (12)→ √ √1 F t F , taking the feature square-root reduces burstiness. Mo- Λ ζ2ΦT(t;ZT) t=1 tivated by these observations, we may incorporate such 1 (cid:88)n (cid:20) √ζ Φ (xj;Z ) (cid:21)⊗r(cid:43) non-linearities to the HOK descriptors as well. As these √ √1 F u F . (13) Λ ζ2ΦT(u;ZT) are higher-order tensors, we apply the following scheme u=1 based on the higher-order SVD decomposition of the ten- Using these derivations, now we are ready to formally sor[19,17,18]. LetH denotetheHOK(X),then define our Higher-order Kernel (HOK) descriptor as fol- lows: [Σ,U1,U2,...,Ur]=HOSVD(H) (15) Definition1(HOK). LetX =(cid:104)x1,x2,...,xn(cid:105),xi ∈[0,1]d Hˆ =sign(Σ)|Σ|α×ri=1Ui (16) aretheprobabilityscoresfromdclassifiersforthenframes where the U’s are the orthonormal matrices (which are all in a sequence. Then, we define the r-th order HOK- thesameinourcase)associatedwiththeTuckerdecomposi- descriptorforX as: tion[23]andΣisthecoretensor.Notethat,unliketheusual √ 1 (cid:88)n (cid:20) ζ Φ (x ;Z ) (cid:21)⊗r SVD operation for matrices, the core tensor in HOSVD is HOK(X)= √ √1 F u F (14) Λ ζ2ΦT(u;ZT) generallynotdiagonal. RefertothenotationsinSection3 u=1 for the definition of ×n . We use Hˆ in training the lin- i=1 for pivot sets Z ⊂ ∆d and Z ⊂ R for the classifier ear classifiers after vetorization. The power-normalization F T scores and the temporal instances respectively. Further, parameterαisselectedviacross-validation. ζ ,ζ ∈ [0,1] such that ζ +ζ = 1, and Λ is a suitable 1 2 1 2 4.5.ComputationalComplexity normalization. For M classes, n frames per sequence, |Z| pivots, and Once the HOK tensor is generated for a sequence, we tensor-orderr,generatingHOKtakesO(n(M+|Z|r)). As vectorize it to be used in a linear classifier for training HOK is super-symmetric, using truncated SVD for a rank againstactionlabels. Ascanbeverified(i.e.,see[19]),the k <<|Z|,HOSVDtakesonlyO(k2|Z|)time. See[18]for HOK tensor will be super-symmetric, and thus removing moredetails. thesymmetricentries,thedimensionalityofthisdescriptor (cid:18) (cid:19) |Z |+|Z |+r−1 is F T . Inthesequel,weuser = 3 4.6.LearningHOKParameters r asatrade-offbetweenperformanceandthedescriptorsize. AnimportantstepinusingtheHOKdescriptoristofind Figure2illustratesouroverallHOKgenerationandclassi- appropriatepivotsetsZ andZ . Giventhatthetemporal F T ficationframework. pivots are uni-dimensional, we select them to be equally- spaced along the time axis after normalizing the tempo- dataset. For the former, we use the evaluation code pub- ral indexes to [0,1]. For Z , which operate on the clas- lishedwiththedataset. F sifier scores, that can be high-dimensional, we propose to 5.3.Preprocessing use an expectation-maximization algorithm. This choice is motivated by the fact that the entries for ζ1ΦF(xu;ZF) As the original MPII cooking videos are of very high in(12)areessentiallycomputingasoft-similaritybetween resolution, whiletheactivitieshappenonlyatcertainparts theclassifierscorevectorsforeveryframeagainstthepiv- of the scene, we used a frame difference scheme to esti- otsthroughaGaussiankernel. Thusmodelingtheproblem mateawindowofthescenetolocalizetheaction.Precisely, inasoft-assignmentsetupusingaGaussianmixturemodel foreverysequence,wefirstconverttheframestohalftheir is natural, the parameters (the mean and the variance) are sizes,followedbyframe-differencing,dilation,smoothing, learnedusingtheEMalgorithm;theseparametersareused andconnectedcomponentanalysis. Thisresultsinabinary asthepivotset.Otherparametersinthemodel,suchasζare imageforeveryframe;whicharethencombinedacrossthe computedusingcross-validation. Thenormalizationfactor sequence and a binary mask is generated for the entire se- Λischosentoben2wherenissequencelength. quence. Weusethelargestboundingboxcontainingallthe connectedcomponentsinthisbinarymaskastheregionof 5.ExperimentsandResults the action, and crops the video to this box. Such cropped framesarethenresizedto224×224sizeandusedtotrain Thissectionprovidesexperimentalevidenceoftheuse- the VGG networks. To compute optical flow, we used the fulnessofourproposedpoolingschemeforfine-grainedac- Brox implementation [3]. Each flow image is rescaled to tionrecognition. Weverifythisontwopopularbenchmark 0–255andsavedasaJPEGimageforstorageefficiencyas datasetsforthistask,namely,(i)theMPIIcookingactivities described in [31]. For the JHMDB dataset, the frames are dataset[28],and(ii)theJHMDBdataset[13]. Notethatwe alreadyinlowresolution. Thus,wedirectlyusetheminthe useaVGG-16model[5]forthetwostreamarchitecturefor CNNafterresizingtotheexpectedinputsizes. both datasets, which is pre-trained on ImageNet for object recognitionandfine-tunedontherespectiveactiondatasets. 5.4.CNNTraining 5.1.Datasets ThetwostreamsoftheCNNaretrainedseparatelyonthe respectiveinputmodalitiesagainstasoftmaxcross-entropy MPIICookingAcitiviesDataset[28]: consistsofhigh- loss. We used the sequences from the training set of the resolution videos of cooking activities recorded by a sta- MPIIcookingactivitiesdatasetfortrainingtheCNNs(1992 tionary camera. The dataset consists of videos of people sequences)andusedthosefromtheprovidedvalidationset cookingvariousdishes;eachvideocontainsasingleperson (615 sequences) to check for overfitting. For the JHMDB, cookingadish, andoverallthereare12suchvideosinthe we used 50% of the training set for fine-tuning the CNNs dataset. Thereare64distinctactivitiesspreadacross3748 ofwhich10%isusedasthevalidationset. Weaugmented video clips and one background activity (1861 clips). The the datasets using random crops, flips and slight rotations activities range from coarse subject motions such as mov- of the frames. While fine-tuning the CNNs (from a pre- ing from X to Y, opening refrigerator, etc., to fine-grained trained imagenet model), we used a fixed learning rate of actionssuchaspeel,slice,cutapart,etc. 10−4 andaninputbatchsizeof50frames. TheCNNtrain- ing was stopped as soon as the loss on the validation set JHMDB Dataset [13]: is a subset of the larger HMDB started increasing, which happened in about 6K iterations dataset[22],butcontainsvideosinwhichthehumanlimbs for the appearance stream and 40K iterations for the flow can be clearly visible. The dataset contains 21 action cat- stream. egories such as brush hair, pick, pour, push, etc. Unlike 5.5.HOKParameterLearning the MPII cooking activities dataset, the videos in JHMDB dataset are low resolution. Each video clip is a few sec- AsisclearfromDef.1,thereareafewhyper-parameters ondslong. Thereareatotalof968videoswhicharemostly associatedwiththeHOKdescriptor. Inthissection,wesys- downloadedfromtheinternet. tematicallyanalyzetheinfluenceoftheseparameterstothe overallclassificationperformanceofthedescriptor. Tothis 5.2.EvaluationProtocols end, we use the JHMDB dataset split1. Specifically, we Wefollowthestandardprotocolssuggestedintheorig- explore the effect of changes to (i) the number of classi- inal publications that introduced these datasets. Thus, we fierpivotsZ , (ii)thenumberoftemporalpivotsZ , and F T use the mean average precision (mAP) over 7-fold cross- (iii) that of the power-normalization factor α. In Figure 3, validationfortheMPIIdataset,whileweusethemeanav- we plot the classifier accuracy against each of these cases. erageaccuracyover3-foldcross-validationfortheJHMDB Each experiment is repeated 5 times with different initial- (a) (b) (c) Figure3: Analysisoftheinfluenceofvarioushyper-parametersontheactionrecognitionaccuracy. Thenumbersarecom- putedonthesplit1oftheJHMDBdataset,whichconsistsof21groundtruthactionclasses. izations (for the GMM) and the average accuracy is used Action Avg. Pool HOK.Pool fortheplots. mAP(%) mAP(%) For(i),wefixedthenumberoftemporalpivotsto5,with ChangeTemperature 15.1 57.5 values [0,0.25,0.5,0.75,1] and fixed the σ = 0.1. The Dry 27.7 50.2 T classifier pivots Z and and their standard deviations σ Fillwaterfromtap 10.5 40.6 F F are found by learning GMM models with the prescribed Open/closedrawer 25.2 65.1 number of Gaussian components. The mean and the di- Table 1: An analysis of per-class action recognition accu- agonal variance from this learned model are then used as racywhenusingaveragepoolingandHOKpooling(thetop thepivotsetanditsvariancerespectively. Asisclearfrom classescorrectedbyHOKpooling). Figure3(a),asthenumberofclassifierpivotsincreases,the accuracyincreasesaswell.However,beyondacertainnum- ber,theaccuracystartsdropping. Thisisperhapsduetothe second-orderstatisticsandourproposedthird-order.For(i), sequencesnotcontainingsufficientnumberofframestoac- weaveragetheclassifiersoft-maxscoresasisusuallydone count for larger models. Note that the JHMDB sequences inlatepooling[31].For(ii),weusethesecond-orderkernel containabout30-40framespersequence. Wealsonotethat matrix without pivoting. Specifically, for every sequence, the accuracy of Flow+RGB is significantly higher than ei- letxiandxjdenotetheprobabilisticevolutionofprobablis- therstreamalone. : : ticscoresforclassifiersiandjrespectively. Then,wecom- For(ii),wefixthenumberofclassifierpivotsat48(asis thebestwefoundfromFigure3(b)),andvariedthenumber puteakernelmatrixK(xi:,xj:) = e(cid:16)−σ(cid:107)xi:−xj:(cid:107)22(cid:17). Asthis oftemporalpivotsfrom1to30instepsof5. Similartothe matrix is a positive definite object, we use log-Euclidean classifierpivots,wefindthatincreasingthenumberoftem- mapofit(thatis,thematrixlogarithm;whichistheasymp- poralpivotsisbeneficial.Further,alargerσ leadstoadrop toticlimitofα→0inpowernormalization)forembedding T inaccuracy,whichimpliesthatorderingoftheprobabilistic itintheEuclideanspace[10]. Thisvectoristhenusedfor scoresdoesplayaroleintherecognitionoftheactivity. training. As isclear, this matrix captures the second-order For (iii), we fixed the number of classifier pivots at 48, statisticsofactions.Andfor(iii),weusetheproposedHOK andthenumberoftemporalpivotsto5(asdescribedfor(i) descriptorasdescribedinDefinition1. AsisclearfromTa- above). Wevariedαfrom0.1to1instepsof0.1. Wefind ble2,higher-orderstatisticsleadstosignificantbenefitson thatαcloserto0ismorepromising,implyingthatthereis boththedatasetsandforboththeinputmodalities(flowand significantinfluenceofburstinessinthesequences. Thatis, RGB)andtheircombinations. reducingmorethelargerprobabilisticco-occurrences(than 5.7.ComparisonstotheStateoftheArt thoseoftheweakco-occurrences)inthetensorleadstobet- terperformance. In Tables 3 and 4, we compare HOK descriptors to the state-of-the-art results on these two datasets. In this case, 5.6.Results we combine the HOK descriptors from both the RGB and In this section, we provide full experimental compar- flowstreamsoftheCNN.FortheMPIIdataset, weuse32 isons for the two datasets. Our main goal is to analyze pivotsfortheclassifierscores, and5equispacedpivotsfor the usefulness of higher-order pooling for action recogni- the time steps, with σ = 0.1. For the second-order ten- T tion. Tothisend,inTable2,weshowtheperformancedif- sor, we use a σ = 0.1 for both datasets. We use the same ferences between using (i) the first-order statistics, (ii) the setupfortheJHMDBdataset,exceptthatweuse48pivots. The power normalization factor is set to 0.1. As is clear, Experiment MPII JHMDB although HOK by itself is not superior to other methods, mAP(%) MeanAvg. Acc(%) when the second- and third-order statistics are combined RGB(avg.pool) 33.9 51.5 (stackingtheirvaluesintoavector),itdemonstratessignif- Flow(avg.pool) 37.6 54.8 icantpromise. Forexample, weseeanimprovementof5– RGB+Flow(avg.pool) 38.1 55.9 6%againsttherecentmethodin[6]thatalsousesaCNN. RGB(second-order) 56.1 52.3 Further, we also find that when the higher-order statistics Flow(second-order) 61.3 60.4 are combined with trajectory features, there is further im- RGB+Flow(second-order) 67.0 63.4 provement in accuracy, which results in a model that out- RGB(HOK) 47.8 52.3 performsthestateoftheart. Flow(HOK) 55.4 58.2 RGB+Flow(HOK) 60.6 64.7 5.8.AnalysisofResults Table 2: Evaluation of the HOK descriptor on the output To gain insights into the performance benefits noted ofeachCNNstreamandtheirfusionontheMPII(7-splits) above, we conducted an analysis of the results on the andJHMDBdatasets(3-splits). Wealsoshowtheaccuracy MPII dataset. Table 1 lists the activities that are initially obtainedviasecond-orderpoolingthatusesthekernelma- confused in average pooling, while corrected by HOK. trixdirectlywithoutlinearization(seetextfordetails). Specifically, we find that activities such as Fill water from tap and Open/close Drawer which are originally confused Algorithm mAP(%) withWashObjectsandTakeoutfromdrawergetscorrected Holistic+Pose,CVPR’12 57.9 usinghigher-orderpooling.Notethattheseactivitiesarein- VideoDarwin,CVPR’15 72.0 herentlyambiguous,unlesscontextandsub-actionsarean- InteractionPartMining,CVPR’15 72.4 alyzed. Thisshowsthatourdescriptorcaneffectivelyrep- P-CNN,ICCV’15 62.3 resentusefulcuesforrecognition. P-CNN+IDT-FV,ICCV’15 71.4 SemanticFeatures,CVPR’15 70.5 InTable2(column1),weseethatthesecond-orderten- HierarchicalMid-LevelActions,ICCV’15 66.8 sor performs significantly better than HOK for the MPII HOK(ours) 60.6 dataset. We suspect this surprising behavior is due to the HOK+Second-order(ours) 69.1 highly unbalanced number of frames in sequences in this HOK+second-order+Trajectories 73.1 dataset. For example, for classes such as pull-out, pour, etc., that have only about 7 clips each of 50–90 frames, Table3: MPIICookingActivitiesdataset(7-splits) the second-order is about 30% better than HOK in mAP, while for classes, such as take spice holder, having more Algorithm Avg. Acc. (%) than 25 videos, with 50–150 frames, HOK is about 10% P-CNN,ICCV’15 61.1 better than second-order. This suggests that the poor per- P-CNN+IDT-FV,ICCV’15 72.2 formanceisperhapsduetotheunreliableestimationofdata ActionTubes,CVPR’15 62.5 statistics and that second- and third-order provide comple- StackedFisherVectors,ECCV’14 69.03 mentarycues,asalsowitnessedinTable3.FortheJHMDB IDT+FV,ICCV’13 62.8 dataset,thereareabout30framesinallsequencesandthus HOK(Ours) 64.7 thestatisticsaremoreconsistent. Anotherreasoncouldbe HOK+second-order(Ours) 66.8 thatunbalancedsequencesmaybiastheGMMparameters, HOK+second-order+IDT-FV 73.3 thatformthepivots,toclassesthathavemoreframes. Table4: JHMBDDataset(3-splits) 6.Conclusion Inthispaper,wepresentedatechniqueforhigher-order Acknowledgements: Thisresearchwassupportedbythe poolingofCNNscoresforthetaskofactionrecognitionin Australian Research Council (ARC) through the Centre of videos. Weshowedhowtousetheideaofkernellineariza- ExcellenceforRoboticVision(CE140100016). tion to generate a higher-order kernel descriptor, which References can capture latent relationships between the CNN classi- fier scores. Our experimental analysis on two standard [1] M. Baccouche, F. Mamalet, C. Wolf, C. Garcia, and fine-grainedactiondatasetsclearlydemonstratesthatusing A. Baskurt. Sequential deep learning for human action higher-orderrelationshipsisbeneficialforthetaskandleads recognition. InHumanBehaviorUnderstanding,pages29– tostate-of-the-artperformance. 39.2011. [2] L.BoandC.Sminchisescu. Efficientmatchkernelbetween [23] L.D.Lathauwer,B.D.Moor,andJ.Vandewalle.Amultilin- setsoffeaturesforvisualrecognition. InNIPS,2009. earsingularvaluedecomposition. SIAMJ.MatrixAnalysis [3] T.BroxandJ.Malik. Largedisplacementopticalflow: de- andApplications,21:1253–1278,2000. scriptor matching in variational motion estimation. PAMI, [24] J.Mairal,P.Koniusz,Z.Harchaoui,andC.Schmid. Convo- 33(3):500–513,2011. lutionalkernelnetworks. NIPS,2014. [4] J.Carreira,R.Caseiro,J.Batista,andC.Sminchisescu. Se- [25] B.Ni,V.R.Paramathayalan,andP.Moulin. Multiplegran- manticsegmentationwithsecond-orderpooling. InECCV. ularityanalysisforfine-grainedactiondetection. InCVPR, 2012. 2014. [5] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman. [26] L.Pishchulin, M. Andriluka, and B. Schiele. Fine-grained Returnofthedevilinthedetails: Delvingdeepintoconvo- activityrecognitionwithholisticandposebasedfeatures. In lutionalnets. InBMVC,2014. PatternRecognition,pages678–689.Springer,2014. [6] G.Che´ron, I.Laptev, andC.Schmid. P-CNN:Pose-based [27] A. Rahimi and B. Recht. Random features for large-scale CNNFeaturesforActionRecognition. InICCV,2015. kernelmachines. InNIPS,2007. [7] J.Donahue,L.A.Hendricks,S.Guadarrama,M.Rohrbach, [28] M. Rohrbach, S. Amin, M. Andriluka, and B. Schiele. A S.Venugopalan,K.Saenko,andT.Darrell.Long-termrecur- databaseforfinegrainedactivitydetectionofcookingactiv- rent convolutional networks for visual recognition and de- ities. InCVPR,2012. scription. InCVPR,2015. [29] M. Rohrbach, A. Rohrbach, M. Regneri, S. Amin, M. An- [8] B. Fernando, E. Gavves, J. M. Oramas, A. Ghodrati, and driluka,M.Pinkal,andB.Schiele.Recognizingfine-grained T.Tuytelaars. Modelingvideoevolutionforactionrecogni- and composite activities using hand-centric features and tion. InCVPR,2015. scriptdata. IJCV,119(3):346–373,2016. [9] J. Gall, A. Yao, N. Razavi, L. Van Gool, and V. Lempit- [30] E.ShechtmanandM.Irani. Space-timebehaviorbasedcor- sky. Houghforestsforobjectdetection,tracking,andaction relation. InCVPR,2005. recognition. PAMI,33(11):2188–2202,2011. [31] K.SimonyanandA.Zisserman. Two-streamconvolutional [10] K.Guo,P.Ishwar,andJ.Konrad. Actionrecognitionfrom networksforactionrecognitioninvideos. InNIPS,2014. videousingfeaturecovariancematrices. TIP,2013. [32] D. Tran, L. D. Bourdev, R. Fergus, L. Torresani, and [11] E.Insafutdinov,L.Pishchulin,B.Andres,M.Andriluka,and M. Paluri. Learning spatiotemporal features with 3d con- B.Schiele. Deepercut: Adeeper,stronger,andfastermulti- volutionalnetworks. InICCV,2015. personposeestimationmodel. InECCV,2016. [33] M.A.VasilescuandD.Terzopoulos. Multilinearanalysisof [12] T. Jebara, R. Kondor, and A. Howard. Probability product kernels. JMLR,5:819–844,2004. imageensembles:Tensorfaces. ECCV,2002. [13] H. Jhuang, J. Gall, S. Zuffi, C. Schmid, and M. J. Black. [34] C.Wang,Y.Wang,andA.L.Yuille. Anapproachtopose- Towardsunderstandingactionrecognition. InICCV,2013. basedactionrecognition. InCVPR,2013. [14] S.Ji,W.Xu,M.Yang,andK.Yu. 3dconvolutionalneural [35] H.Wang, A.Kla¨ser, C.Schmid, andC.-L.Liu. Densetra- networks for human action recognition. PAMI, 35(1):221– jectoriesandmotionboundarydescriptorsforactionrecog- 231,2013. nition. IJCV,103(1):60–79,2013. [15] A.Karpathy,G.Toderici,S.Shetty,T.Leung,R.Sukthankar, [36] L. Wang, Y. Qiao, and X. Tang. Action recognition with andL.Fei-Fei. Large-scalevideoclassificationwithconvo- trajectory-pooleddeep-convolutionaldescriptors. InCVPR, lutionalneuralnetworks. InCVPR,2014. 2015. [16] T. G. Kolda and B. W. Bader. Tensor decompositions and [37] S.-E.Wei,V.Ramakrishna,T.Kanade,andY.Sheikh. Con- applications. SIAMReview,51(3):455–500,2009. volutionalposemachines. InCVPR,2016. [17] P. Koniusz and A. Cherian. Sparse coding for third-order [38] C. Williams and M. Seeger. Using the nystro¨m method to super-symmetric tensor descriptors with application to tex- speedupkernelmachines. InNIPS,2001. turerecognition. InCVPR,2016. [39] J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan, [18] P. Koniusz, A. Cherian, and F. Porikli. Tensor representa- O.Vinyals,R.Monga,andG.Toderici. Beyondshortsnip- tionsviakernellinearizationforactionrecognitionfrom3D pets:Deepnetworksforvideoclassification.InCVPR,2015. skeletons. ECCV,2016. [40] Y.Zhou,B.Ni,R.Hong,M.Wang,andQ.Tian. Interaction [19] P.Koniusz,F.Yan,P.Gosselin,andK.Mikolajczyk.Higher- part mining: A mid-level approach for fine-grained action orderoccurrencepoolingforbags-of-words: Visualconcept recognition. InCVPR,2015. detection. PAMI,2016. [41] S.ZuffiandM.J.Black. Puppetflow. IJCV,101(3):437– [20] P.Koniusz,F.Yan,andK.Mikolajczyk.Comparisonofmid- 458,2013. levelfeaturecodingapproachesandpoolingstrategiesinvi- sualconceptdetection. CVIU,2012. [21] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS,2012. [22] H.Kuehne,H.Jhuang,E.Garrote,T.Poggio,andT.Serre. Hmdb:alargevideodatabaseforhumanmotionrecognition. InICCV,2011.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.