Higher-order Pooling of CNN Features via Kernel Linearization for Action Recognition AnoopCherian1,3 PiotrKoniusz2,3 StephenGould1,3 1 AustralianCenterforRoboticVision 2Data61/CSIRO, 3 TheAustralianNationalUniversity,Canberra,Australia [email protected] [email protected] [email protected] 7 1 0 2 Abstract n a Most successful deep learning algorithms for action J recognition extend models designed for image-based tasks 9 such as object recognition to video. Such extensions are 1 typicallytrainedforactionsonsinglevideoframesorvery ] shortclips,andthentheirpredictionsfromsliding-windows V overthevideosequencearepooledforrecognizingtheac- C tion at the sequence level. Usually this pooling step uses . the first-order statistics of frame-level action predictions. s Figure 1: Fine-grained action instances from two different c In this paper, we explore the advantages of using higher- action categories: cut-in (left) and slicing (right). These [ ordercorrelations;specifically,weintroduceHigher-order instancesarefromtheMPIIcookingactivitiesdataset[28]. 1 Kernel(HOK)descriptorsgeneratedfromthelatefusionof v CNNclassifierscoresfromalltheframesinasequence. To 2 generate these descriptors, we use the idea of kernel lin- 3 of action recognition, namely fine-grained action recogni- earization. Specifically, a similarity kernel matrix, which 4 tion. Thisproblemcategoryischaracterizedbyactionsthat captures the temporal evolution of deep classifier scores, 5 have very weak intra-class appearance similarity (such as 0 is first linearized into kernel feature maps. The HOK cutting tomatoes versus cutting cucumbers), while strong . descriptors are then generated from the higher-order co- 1 inter-classsimilarity(peelingcucumbersversusslicingcu- 0 occurrences of these feature maps, and are then used as cumbers). Figure1illustratestwofine-grainedactions. 7 inputtoavideo-levelclassifier. Weprovideexperimentson 1 twofine-grainedactionrecognitiondatasets,andshowthat Unsurprisingly, the most recent trend for fine-grained v: ourschemeleadstostate-of-the-artresults. activity recognition is based on convolutional neural net- i works (CNN) [6, 14]. These schemes extend frameworks X developed for general purpose action recognition [31] into r a 1.Introduction thefine-grainedsettingbyincorporatingheuristicsthatex- tract auxiliary discriminative cues, such as the position of With the resurgence of efficient deep learning algo- body parts or indicators of human-to-object interaction. A rithms, recent years have seen significant advancements in technicaldifficultywhenextendingCNN-basedmethodsto severalfundamentalproblemsincomputervision,including videosisthatunlikeobjectsinimages,theactionsinvideos human action recognition in videos [31, 32, 15, 7]. How- arespreadacrossseveralframes. Thustocorrectlyinferac- ever, despitethisprogress, solutionsforactionrecognition tionsaCNNmustbetrainedontheentirevideosequence. are far from being practically useful, especially in a real- However, the current computational architectures are pro- world setting. This is because real-world actions are often hibitiveinusingmorethanafewtensofframes, thuslim- subtle,mayusehard-to-detecttools(suchasknives,peelers, iting the size of the temporal receptive fields. For exam- etc.), may involve strong appearance or human pose vari- ple,thetwostreammodel[31]usessingleframesoratiny ations, may be of different durations, or done at different sets of optical flow images for learning the actions. One speeds, among several other complicating factors. In this mayovercomethisdifficultybyswitchingtorecurrentnet- paper, we study a difficult subset of the general problem works[1]whichtypicallyrequirelargetrainingsetsforef- fectivelearning. shift-invariant kernels (Section. 3). Using this technique, Asisclear,usingsingleframestotrainCNNsmightbe wepropose Higher-orderKernel descriptors(Section. 4.3) insufficient to capture the dynamics of actions effectively, that capture the higher-order co-occurrences of the kernel while a large stack of frames requires a larger number of maps against the pivots. We apply a non-linear operator CNN parameters that result in model overfitting, thereby (suchaseigenvaluepowernormalization[19])totheHOK demandinglargertrainingsetsandcomputationalresources. descriptors as it is known to lead to superior classification ThisproblemalsoexistsinotherpopularCNNarchitectures performance. The HOK descriptor is then vectorized and suchas3DCNNs[32,14]. Thus, state-of-the-artdeepac- used for training actions. Note that the HOK descriptors, tionrecognitionmodelsareusuallytrainedtogenerateuse- thatweproposehere,belongtothemoregeneralfamilyof ful features from short video clips that are then pooled to third-ordersuper-symmetrictensor(TOSST)descriptorsin- generateholisticsequenceleveldescriptors,whicharethen troducedin[17,18]. usedtotrainalinearclassifieronactionlabels. Forexam- Weprovideexperiments(Section5)usingtheproposed ple,inthetwo-streammodel [31],thesoft-maxscoresfrom scheme on two standard action datasets, namely (i) the thefinalCNNlayersfromtheRGBandopticalflowstreams MPII Cooking activities dataset [28], and (ii) the JHMDB are combined using average pooling. Note that average dataset [13]. Our results demonstrate that higher-order pooling captures only the first-order correlations between poolingisusefulandcanperformcompetitivelytothestate- thescores;ahigher-orderpooling[19]thatcaptureshigher- of-the-artmethods. order correlations between the CNN features can be more 2.RelatedWork appropriate, which is the main motivation for the scheme proposedinthispaper. Initial approaches to tackle fine-grained action recogni- Specifically, we assume a two-stream CNN framework tion have been direct extensions of schemes developed for as suggested in [31] with separate RGB image and optical thegeneralclassificationproblems,mainlybasedonhand- flow streams. Each of these streams is trained on single crafted features. A few notable such approaches include RGBoropticalflowframesfromthesequencesagainstthe [26,28,29,35],inwhichfeaturessuchasHOG,SIFT,HOF, actionlabels. Asnotedabove,whileCNNclassifierpredic- etc., are first extracted at spatio-temporal interest point lo- tions at the frame-level might be very noisy, we posit that cations (e.g., following dense trajectories) and then fused the correlations between the temporal evolution of classi- andusedtotrainaclassifier. However,therecenttrendhas fier scores can capture useful action cues which may help movedtowardsdatadrivenfeaturelearningviadeeplearn- improve recognition performance. Intuitively, some of the ingplatforms[21,31,14,32,7,39]. Asalludedtoearlier, actions might have unlabelled pre-cursors (such as Walk- thelackofsufficientannotatedvideodata,andtheneedfor ing, Picking, etc.). Using higher-order action occurrences expensivecomputationalinfrastructure,makesdirectexten- inasequencemaybeabletocapturesuchpre-cursors,lead- sion of these frameworks (which are primarily developed ingtobetterdiscriminationoftheaction,whiletheymaybe forimagerecognitiontasks)challengingforvideodata,thus ignored as noise when using a first-order pooling scheme. demandingefficientrepresentations. In this paper, we use this intuition to develop a theoretical Another promising setup for fine-grained action recog- framework for action recognition using higher-order pool- nition has been to use mid-level features such as human ingonthetwo-streamclassifierscores. pose. Clearly, estimating the human pose and develop- Ourpoolingschemeisbasedonkernellinearization—a ing action recognition systems on it disentangles the ac- simple technique that decomposes a similarity kernel ma- tion inference from operating directly on pixel-level, thus trixcomputedontheinputdataintermsofasetofanchor allowing for higher-level of sophisticated action reason- points(pivots).Usingakernel(specifically,aGaussianker- ing [28, 34, 41, 6]. Although, there have been significant nel) offers richer representational power, e.g., by embed- advancementsintheposeestimationrecently[37,11],most dingthedatainaninfinitedimensionalHilbertspace. The of these models are computationally demanding and thus use of the kernel also allows us to easily incorporate addi- difficulttoscaletomillionsofvideoframesthatformstan- tionalpriorknowledgeabouttheactions. Forexample,we darddatasets. showinSection4howtoimposeasoft-temporalgrammar Adifferentapproachtofine-grainedrecognitionistode- ontheactionsinthesequence. Kernelsareaclassofposi- tect and analyze human-object interactions in the videos. tivedefiniteobjectsthatbelongtoanon-linearbutRieman- Such a technique is proposed in Zhou et al. [40]. Their niangeometry,andthusdirectlyevaluatingthemforuseina method starts by generating region proposals for human- dualSVM(c.f. primalSVM)iscomputationallyexpensive. objectinteractionsinthescene,extractsvisualfeaturesfrom Therefore, these kernels are usually linearized into feature these regions and trains a classifier for action classes on mapssothatfastlinearclassifierscanbeappliedinstead. In these features. A scheme based on tracking human hands this paper, we use a linearization technique developed for and their interactions with objects is presented in Ni et al.[25]. Houghforestsforactionrecognitionareproposed by Πr a . We use the notation A = S ×n P for j=1 ij k=1 k inGalletal.[9]. Althoughrecognizingobjectsmaybeuse- matrices P and a tensor S to denote that A = k (i1,i2,...,in) ful,theymaynotbeeasilydetectableinthecontextoffine- (cid:80) (cid:80) ...(cid:80) S P(i1,j1)P(i2,j2)...P(in,jn). grainedactions. Thji1snotja2tionarjinses(iin1,it2h,e...T,inu)cke1rdecom2positionnofhigher- We also note that there have been several other deep order tensors, see [23, 16] for details. We note that the learningmodelsdevisedforactionmodelingsuchasusing inner-productbetweentwosuchtensorsfollowsthegeneral 3Dconvolutionalfilters[14],recurrentneuralnetworks[1], element-wise product and summation, as is typically used long-shorttermmemorynetworks[7],andlargescalevideo inlinearalgebra. Weassumeinthesequelthattheorderr classification architectures [15]. These models demand is ordinal and greater than zero. We use the notation (cid:104)·,·(cid:105) hugecollectionsofvideosforeffectivetraining, whichare todenotethestandardEuclideaninnerproductand∆d for usually unavailable for fine-grained activity tasks and thus thed-dimensionalsimplex. theapplicabilityofthesemodelsisyettobeascertained. Poolinghasbeenausefultechniqueforreducingthesize 3.2.KernelLinearization ofvideorepresentations,therebyenablingtheapplicability LetX ={x ∈Rn :t=1,2,...,T}beasetofdatain- t ofefficientmachinelearningalgorithmstothisdatamodal- stancesproducedbysomedynamicprocessatdiscretetime ity. Recently,apoolingschemepreservingthetemporalor- instancest. LetK beakernelmatrixcreatedfromX, the deroftheframesisproposedbyFernandoetal.[8]bysolv- ij-thelementofwhichis: ingaRank-SVMformulation.InWangetal.[36],deepfea- tures are fused along action trajectories in the video. Cor- K =(cid:104)φ(x ),φ(x )(cid:105), (1) ij i j relationsbetweenspace-timefeaturesareproposedin[30]. whereφ(·)representssomekernelizedsimilaritymap. EarlyandlatefusionofCNNfeaturemapsforactionrecog- nitionareproposedin[15,39]. Ourproposedhigher-order Theorem1(LinearExpansionoftheGaussianKernel). For pooling scheme is somewhat similar to the second- and allxandx(cid:48)inRnandσ >0, higher-order pooling approaches proposed in [4] and [19] thatgeneraterepresentationsfromlow-leveldescriptorsfor ψ(cid:18)x−x(cid:48)(cid:19)=e(cid:16)−2σ12(cid:107)x−x(cid:48)(cid:107)22(cid:17) (2) thetaskofsemanticsegmentationofimagesandobjectcat- σ escgroirpytorrecisoginnsitpioirne,drebsypethcteivseelyq.ueMncoerecoovmepr,atoiubriliHtyOkKerdnee-l =(cid:18) 2 (cid:19)n2 (cid:90) e(−σ12(cid:107)x−z(cid:107)22)e(cid:16)−σ12(cid:107)x(cid:48)−z(cid:107)22(cid:17)dz πσ2 z∈Rn (SCK) descriptor introduced in [18] which pools higher- (3) orderoccurrencesoffeaturemapsfromskeletalbodyjoints for action recognition. In contrast, we use the frame-level Proof. See[12,24]. predictionvectors(outputoffc8layers)fromthedeepclas- sifierstogenerateourpooleddescriptors,therefore,thesize We can approximate the linearized kernel ofourpooleddescriptorsisafunctionofthenumberofac- by choosing a finite set of pivots z ∈ Z = tion classes. Moreover, unlike SCK, that uses pose skele- {z1,z2,...,zK;σ1,σ2,...,σK}1. Using Z, we can tons, we use raw action videos directly. Our work is also rewrite(3)as: different from works such as [17, 33] in which tensor de- ψ(x−x(cid:48))≈(cid:104)φ˜(x;Z), φ˜(x(cid:48);Z)(cid:105) (4) scriptorsareproposedonhand-craftedfeatures. Inthispa- per, we show how CNNs could benefit from higher-order where pooling for the application of fine-grained action recogni- (cid:18) (cid:19) tion. φ˜(x;Z)= e−σ112(cid:107)x−z1(cid:107)22,...,e−σ1K2 (cid:107)x−zK(cid:107)22 . (5) 3.Background Wecallφ˜(x;Z)astheapproximatefeaturemapforthein- Inthissection,wefirstreviewthetensornotationthatwe putdatapointx. useinthefollowingsections. Thisprecedesanintroduction In the sequel, we linearize the constituent kernels as tothenecessarytheorybehindkernellinearizationonwhich this allows us to obtain linear maps, which results in ourdescriptorframeworkisbased. favourablecomputationalcomplexities,i.e.,weavoidcom- puting a kernel explicitly between tens of thousands of 3.1.Notation video sequences [18]. Also, see [2, 38] for connections of Let a ∈ Rd be a d-dimensional vector. Then, we our method with the Nystro¨m approximation and random use A = a⊗r to denote the r-mode super-symmetric Fourierfeatures[27]. tensor generated by the r-th order outer-product of a, 1To simplify the notation, we assume that the pivot set includes the where the element at the (i1,i2,...,ir)-th index is given bandwidthsσassociatedwitheachpivotaswell. 4.ProposedMethod whereψ andψ aretwoRBFkernels. Thekernelfunction 1 2 ψ capturesthesimilaritybetweenthetwoclassifierscores 1 Inthissection,wefirstdescribeouroverallCNNframe- at timesteps t and u, while the kernel ψ puts a smooth- 2 work based on the popular two-stream deep model for ac- ing on the length of the interval [t,u]. A small bandwidth tion recognition [31]. This precedes exposition of our σ will demand the two classifier scores be strongly cor- T higher-orderpoolingframework,inwhichweintroduceap- related at the respective time instances, while a larger σ T propriate kernels for the task in hand. We also describe a allows some variance (and hence more robustness) in cap- setupforlearningtheparametersofthekernelmaps. turingthesecorrelations. Inthefollowing,welookatlin- earizing the kernel in (7) for generating higher-order Ker- 4.1.ProblemFormulation nel(HOK)descriptors. Theparameterr capturestheorder LetS = {S ,S ,...,S }beasetofvideosequences, statistics of the kernel, as will be clear in the next section; 1 2 N each sequence belonging to one of M action classes with ζ and ζ are weights associated with the kernels, and we 1 2 labels from the set L = {(cid:96) ,(cid:96) ,...,(cid:96) }. Let S = assumeζ ,ζ >0,ζ +ζ =1. Λisthenormalizationcon- 1 2 M 1 2 1 2 (cid:104)f ,f ,...,f (cid:105) be a sequence of frames of S ∈ S, and stantassociatedwiththekernellinearization(see(3)). Note 1 2 n (cid:83) let F = {f |f ∈S}. In the action recognition thatweassumeallthesequencesareofthesamelengthnin S∈S i i setup, our goal is to find a mapping from any given se- the kernel formulation. This is a mere technicality to sim- quence to its ground truth label. Assume we have trained plifyourderivations.Aswillbeseeninthenextsection,our frame-level action classifiers for each class and that these HOKdescriptordependsonlyonthelengthofonesequence classifierscannotseealltheframesinasequencetogether. (seeDefinition1below). Suppose P : F → [0,1] is one such classifier trained to m 4.3.Higher-orderKernelDescriptors produceaconfidencescoreforaninputframetobelongto the m-th class. Since a single frame is unlikely to repre- Thefollowingeasilyverifiableresult[19]willbehandy sentwelltheentiresequence,theclassifierPmisinaccurate inunderstandingourderivations. at determining the action at the sequence level. However, Proposition 1. Suppose a,b ∈ Rd are two arbitrary vec- ourhypothesisisthatacombinationofthepredictionsfrom tors,thenforanordinalr >0 alltheclassifiersacrossalltheframesinasequencecould capturediscriminativepropertiesoftheactionandcouldim- (cid:0)aTb(cid:1)r =(cid:10)a⊗r,b⊗r(cid:11). (8) proverecognition. Inthesequel,weexplorethispossibility For simplifying the notations, we assume xi = P(fi), inthecontextofhigher-ordertensordescriptors. t t the score vector for frame f in S . Further, suppose we t i 4.2.CorrelationsbetweenClassifierPredictions haveasetofpivotsZ andZ fortheclassifierscoresand F T thetimesteps, respectively. Then, applyingthekernellin- Using the notations defined above, let earization in (5) to (7) using these pivots, we can rewrite (cid:104)f ,f ,...,f (cid:105),f ∈ S denote a sequence corresponding 1 2 n i eachkernelas: to each frame and let P (f ) denote the probability that baeclolansgsitfioecrltarsasin(cid:96)emd.fTohretnh,me mi-th action class predicts fi to ψ1(cid:18)xitσ−Fxju(cid:19)≈zF(cid:88)∈ZF φF(xit;zF)φF(xju;zF) (9) Pˆ(f )=[P (f ),P (f ),...,P (f )]T , (6) =ΦF(xit;ZF)TΦF(xju;ZF) i 1 i 2 i M i (cid:18) (cid:19) t−u (cid:88) ψ ≈ φ (t;z )φ (u;z ) (10) denotes the class confidence vector for frame i. As de- 2 σ T T T T T scribed earlier, we assume that there exists a strong corre- zT∈ZT =Φ (t;Z )TΦ (u;Z ). lationbetweentheconfidencesoftheclassifiersacrosstime T T T T (temporal evolution of the classifier scores) for the frames Substituting(11)into(7),wehave: from similar sequences; i.e., frames that are confused be- tween different classifiers should be confused in a similar n way for different sequences. To capture these correlations K(S ,S )≈ 1 (cid:88) (cid:2)ζ Φ (xi;Z )TΦ (xj;Z )+ betweentheclassifierpredictions,weproposetouseaker- i j Λ 1 F t F F u F t,u=1 nelformulationonthescoresfromsequences,theij-then- ζ Φ (t;Z )TΦ (u;Z )(cid:3)r (11) tryofthiskernelisasfollows: 2 T T T T K(S ,S )=1 (cid:88)n (cid:34)ζ ψ(cid:32)Pˆ(ft)−Pˆ(fu)(cid:33)+ζ ψ (cid:18)t−u(cid:19)(cid:35)r, =1 (cid:88)n (cid:42)(cid:20)√√ζ1ΦF(xit;ZF)(cid:21)⊗r,(cid:20)√√ζ1ΦF(xju;ZF)(cid:21)⊗r(cid:43), i j Λt,u=1 1 1 σF 2 2 σT Λt,u=1 ζ2ΦT(t;ZT) ζ2ΦT(u;ZT) (7) (12) Figure 2: Our overall Higher-order Kernel (HOK) descriptor generation and classification pipeline. PN stands for eigen power-normalization. whereweappliedProposition1to(11).Aseachcomponent 4.4.PowerNormalization intheinnerproductin(12)isindependentintherespective It is often found that using a non-linear operator on temporal indexes, we can carry the summation inside the higher-order tensors leads to significantly better perfor- termsleadingto: mance [20]. For example, for BOW, unit-normalization is (cid:42) 1 (cid:88)n (cid:20) √ζ Φ (xi;Z ) (cid:21)⊗r known to avoid the impact of background features, while (12)→ √ √1 F t F , taking the feature square-root reduces burstiness. Mo- Λ ζ2ΦT(t;ZT) t=1 tivated by these observations, we may incorporate such 1 (cid:88)n (cid:20) √ζ Φ (xj;Z ) (cid:21)⊗r(cid:43) non-linearities to the HOK descriptors as well. As these √ √1 F u F . (13) Λ ζ2ΦT(u;ZT) are higher-order tensors, we apply the following scheme u=1 based on the higher-order SVD decomposition of the ten- Using these derivations, now we are ready to formally sor[19,17,18]. LetH denotetheHOK(X),then define our Higher-order Kernel (HOK) descriptor as fol- lows: [Σ,U1,U2,...,Ur]=HOSVD(H) (15) Definition1(HOK). LetX =(cid:104)x1,x2,...,xn(cid:105),xi ∈[0,1]d Hˆ =sign(Σ)|Σ|α×ri=1Ui (16) aretheprobabilityscoresfromdclassifiersforthenframes where the U’s are the orthonormal matrices (which are all in a sequence. Then, we define the r-th order HOK- thesameinourcase)associatedwiththeTuckerdecomposi- descriptorforX as: tion[23]andΣisthecoretensor.Notethat,unliketheusual √ 1 (cid:88)n (cid:20) ζ Φ (x ;Z ) (cid:21)⊗r SVD operation for matrices, the core tensor in HOSVD is HOK(X)= √ √1 F u F (14) Λ ζ2ΦT(u;ZT) generallynotdiagonal. RefertothenotationsinSection3 u=1 for the definition of ×n . We use Hˆ in training the lin- i=1 for pivot sets Z ⊂ ∆d and Z ⊂ R for the classifier ear classifiers after vetorization. The power-normalization F T scores and the temporal instances respectively. Further, parameterαisselectedviacross-validation. ζ ,ζ ∈ [0,1] such that ζ +ζ = 1, and Λ is a suitable 1 2 1 2 4.5.ComputationalComplexity normalization. For M classes, n frames per sequence, |Z| pivots, and Once the HOK tensor is generated for a sequence, we tensor-orderr,generatingHOKtakesO(n(M+|Z|r)). As vectorize it to be used in a linear classifier for training HOK is super-symmetric, using truncated SVD for a rank againstactionlabels. Ascanbeverified(i.e.,see[19]),the k <<|Z|,HOSVDtakesonlyO(k2|Z|)time. See[18]for HOK tensor will be super-symmetric, and thus removing moredetails. thesymmetricentries,thedimensionalityofthisdescriptor (cid:18) (cid:19) |Z |+|Z |+r−1 is F T . Inthesequel,weuser = 3 4.6.LearningHOKParameters r asatrade-offbetweenperformanceandthedescriptorsize. AnimportantstepinusingtheHOKdescriptoristofind Figure2illustratesouroverallHOKgenerationandclassi- appropriatepivotsetsZ andZ . Giventhatthetemporal F T ficationframework. pivots are uni-dimensional, we select them to be equally- spaced along the time axis after normalizing the tempo- dataset. For the former, we use the evaluation code pub- ral indexes to [0,1]. For Z , which operate on the clas- lishedwiththedataset. F sifier scores, that can be high-dimensional, we propose to 5.3.Preprocessing use an expectation-maximization algorithm. This choice is motivated by the fact that the entries for ζ1ΦF(xu;ZF) As the original MPII cooking videos are of very high in(12)areessentiallycomputingasoft-similaritybetween resolution, whiletheactivitieshappenonlyatcertainparts theclassifierscorevectorsforeveryframeagainstthepiv- of the scene, we used a frame difference scheme to esti- otsthroughaGaussiankernel. Thusmodelingtheproblem mateawindowofthescenetolocalizetheaction.Precisely, inasoft-assignmentsetupusingaGaussianmixturemodel foreverysequence,wefirstconverttheframestohalftheir is natural, the parameters (the mean and the variance) are sizes,followedbyframe-differencing,dilation,smoothing, learnedusingtheEMalgorithm;theseparametersareused andconnectedcomponentanalysis. Thisresultsinabinary asthepivotset.Otherparametersinthemodel,suchasζare imageforeveryframe;whicharethencombinedacrossthe computedusingcross-validation. Thenormalizationfactor sequence and a binary mask is generated for the entire se- Λischosentoben2wherenissequencelength. quence. Weusethelargestboundingboxcontainingallthe connectedcomponentsinthisbinarymaskastheregionof 5.ExperimentsandResults the action, and crops the video to this box. Such cropped framesarethenresizedto224×224sizeandusedtotrain Thissectionprovidesexperimentalevidenceoftheuse- the VGG networks. To compute optical flow, we used the fulnessofourproposedpoolingschemeforfine-grainedac- Brox implementation [3]. Each flow image is rescaled to tionrecognition. Weverifythisontwopopularbenchmark 0–255andsavedasaJPEGimageforstorageefficiencyas datasetsforthistask,namely,(i)theMPIIcookingactivities described in [31]. For the JHMDB dataset, the frames are dataset[28],and(ii)theJHMDBdataset[13]. Notethatwe alreadyinlowresolution. Thus,wedirectlyusetheminthe useaVGG-16model[5]forthetwostreamarchitecturefor CNNafterresizingtotheexpectedinputsizes. both datasets, which is pre-trained on ImageNet for object recognitionandfine-tunedontherespectiveactiondatasets. 5.4.CNNTraining 5.1.Datasets ThetwostreamsoftheCNNaretrainedseparatelyonthe respectiveinputmodalitiesagainstasoftmaxcross-entropy MPIICookingAcitiviesDataset[28]: consistsofhigh- loss. We used the sequences from the training set of the resolution videos of cooking activities recorded by a sta- MPIIcookingactivitiesdatasetfortrainingtheCNNs(1992 tionary camera. The dataset consists of videos of people sequences)andusedthosefromtheprovidedvalidationset cookingvariousdishes;eachvideocontainsasingleperson (615 sequences) to check for overfitting. For the JHMDB, cookingadish, andoverallthereare12suchvideosinthe we used 50% of the training set for fine-tuning the CNNs dataset. Thereare64distinctactivitiesspreadacross3748 ofwhich10%isusedasthevalidationset. Weaugmented video clips and one background activity (1861 clips). The the datasets using random crops, flips and slight rotations activities range from coarse subject motions such as mov- of the frames. While fine-tuning the CNNs (from a pre- ing from X to Y, opening refrigerator, etc., to fine-grained trained imagenet model), we used a fixed learning rate of actionssuchaspeel,slice,cutapart,etc. 10−4 andaninputbatchsizeof50frames. TheCNNtrain- ing was stopped as soon as the loss on the validation set JHMDB Dataset [13]: is a subset of the larger HMDB started increasing, which happened in about 6K iterations dataset[22],butcontainsvideosinwhichthehumanlimbs for the appearance stream and 40K iterations for the flow can be clearly visible. The dataset contains 21 action cat- stream. egories such as brush hair, pick, pour, push, etc. Unlike 5.5.HOKParameterLearning the MPII cooking activities dataset, the videos in JHMDB dataset are low resolution. Each video clip is a few sec- AsisclearfromDef.1,thereareafewhyper-parameters ondslong. Thereareatotalof968videoswhicharemostly associatedwiththeHOKdescriptor. Inthissection,wesys- downloadedfromtheinternet. tematicallyanalyzetheinfluenceoftheseparameterstothe overallclassificationperformanceofthedescriptor. Tothis 5.2.EvaluationProtocols end, we use the JHMDB dataset split1. Specifically, we Wefollowthestandardprotocolssuggestedintheorig- explore the effect of changes to (i) the number of classi- inal publications that introduced these datasets. Thus, we fierpivotsZ , (ii)thenumberoftemporalpivotsZ , and F T use the mean average precision (mAP) over 7-fold cross- (iii) that of the power-normalization factor α. In Figure 3, validationfortheMPIIdataset,whileweusethemeanav- we plot the classifier accuracy against each of these cases. erageaccuracyover3-foldcross-validationfortheJHMDB Each experiment is repeated 5 times with different initial- (a) (b) (c) Figure3: Analysisoftheinfluenceofvarioushyper-parametersontheactionrecognitionaccuracy. Thenumbersarecom- putedonthesplit1oftheJHMDBdataset,whichconsistsof21groundtruthactionclasses. izations (for the GMM) and the average accuracy is used Action Avg. Pool HOK.Pool fortheplots. mAP(%) mAP(%) For(i),wefixedthenumberoftemporalpivotsto5,with ChangeTemperature 15.1 57.5 values [0,0.25,0.5,0.75,1] and fixed the σ = 0.1. The Dry 27.7 50.2 T classifier pivots Z and and their standard deviations σ Fillwaterfromtap 10.5 40.6 F F are found by learning GMM models with the prescribed Open/closedrawer 25.2 65.1 number of Gaussian components. The mean and the di- Table 1: An analysis of per-class action recognition accu- agonal variance from this learned model are then used as racywhenusingaveragepoolingandHOKpooling(thetop thepivotsetanditsvariancerespectively. Asisclearfrom classescorrectedbyHOKpooling). Figure3(a),asthenumberofclassifierpivotsincreases,the accuracyincreasesaswell.However,beyondacertainnum- ber,theaccuracystartsdropping. Thisisperhapsduetothe second-orderstatisticsandourproposedthird-order.For(i), sequencesnotcontainingsufficientnumberofframestoac- weaveragetheclassifiersoft-maxscoresasisusuallydone count for larger models. Note that the JHMDB sequences inlatepooling[31].For(ii),weusethesecond-orderkernel containabout30-40framespersequence. Wealsonotethat matrix without pivoting. Specifically, for every sequence, the accuracy of Flow+RGB is significantly higher than ei- letxiandxjdenotetheprobabilisticevolutionofprobablis- therstreamalone. : : ticscoresforclassifiersiandjrespectively. Then,wecom- For(ii),wefixthenumberofclassifierpivotsat48(asis thebestwefoundfromFigure3(b)),andvariedthenumber puteakernelmatrixK(xi:,xj:) = e(cid:16)−σ(cid:107)xi:−xj:(cid:107)22(cid:17). Asthis oftemporalpivotsfrom1to30instepsof5. Similartothe matrix is a positive definite object, we use log-Euclidean classifierpivots,wefindthatincreasingthenumberoftem- mapofit(thatis,thematrixlogarithm;whichistheasymp- poralpivotsisbeneficial.Further,alargerσ leadstoadrop toticlimitofα→0inpowernormalization)forembedding T inaccuracy,whichimpliesthatorderingoftheprobabilistic itintheEuclideanspace[10]. Thisvectoristhenusedfor scoresdoesplayaroleintherecognitionoftheactivity. training. As isclear, this matrix captures the second-order For (iii), we fixed the number of classifier pivots at 48, statisticsofactions.Andfor(iii),weusetheproposedHOK andthenumberoftemporalpivotsto5(asdescribedfor(i) descriptorasdescribedinDefinition1. AsisclearfromTa- above). Wevariedαfrom0.1to1instepsof0.1. Wefind ble2,higher-orderstatisticsleadstosignificantbenefitson thatαcloserto0ismorepromising,implyingthatthereis boththedatasetsandforboththeinputmodalities(flowand significantinfluenceofburstinessinthesequences. Thatis, RGB)andtheircombinations. reducingmorethelargerprobabilisticco-occurrences(than 5.7.ComparisonstotheStateoftheArt thoseoftheweakco-occurrences)inthetensorleadstobet- terperformance. In Tables 3 and 4, we compare HOK descriptors to the state-of-the-art results on these two datasets. In this case, 5.6.Results we combine the HOK descriptors from both the RGB and In this section, we provide full experimental compar- flowstreamsoftheCNN.FortheMPIIdataset, weuse32 isons for the two datasets. Our main goal is to analyze pivotsfortheclassifierscores, and5equispacedpivotsfor the usefulness of higher-order pooling for action recogni- the time steps, with σ = 0.1. For the second-order ten- T tion. Tothisend,inTable2,weshowtheperformancedif- sor, we use a σ = 0.1 for both datasets. We use the same ferences between using (i) the first-order statistics, (ii) the setupfortheJHMDBdataset,exceptthatweuse48pivots. The power normalization factor is set to 0.1. As is clear, Experiment MPII JHMDB although HOK by itself is not superior to other methods, mAP(%) MeanAvg. Acc(%) when the second- and third-order statistics are combined RGB(avg.pool) 33.9 51.5 (stackingtheirvaluesintoavector),itdemonstratessignif- Flow(avg.pool) 37.6 54.8 icantpromise. Forexample, weseeanimprovementof5– RGB+Flow(avg.pool) 38.1 55.9 6%againsttherecentmethodin[6]thatalsousesaCNN. RGB(second-order) 56.1 52.3 Further, we also find that when the higher-order statistics Flow(second-order) 61.3 60.4 are combined with trajectory features, there is further im- RGB+Flow(second-order) 67.0 63.4 provement in accuracy, which results in a model that out- RGB(HOK) 47.8 52.3 performsthestateoftheart. Flow(HOK) 55.4 58.2 RGB+Flow(HOK) 60.6 64.7 5.8.AnalysisofResults Table 2: Evaluation of the HOK descriptor on the output To gain insights into the performance benefits noted ofeachCNNstreamandtheirfusionontheMPII(7-splits) above, we conducted an analysis of the results on the andJHMDBdatasets(3-splits). Wealsoshowtheaccuracy MPII dataset. Table 1 lists the activities that are initially obtainedviasecond-orderpoolingthatusesthekernelma- confused in average pooling, while corrected by HOK. trixdirectlywithoutlinearization(seetextfordetails). Specifically, we find that activities such as Fill water from tap and Open/close Drawer which are originally confused Algorithm mAP(%) withWashObjectsandTakeoutfromdrawergetscorrected Holistic+Pose,CVPR’12 57.9 usinghigher-orderpooling.Notethattheseactivitiesarein- VideoDarwin,CVPR’15 72.0 herentlyambiguous,unlesscontextandsub-actionsarean- InteractionPartMining,CVPR’15 72.4 alyzed. Thisshowsthatourdescriptorcaneffectivelyrep- P-CNN,ICCV’15 62.3 resentusefulcuesforrecognition. P-CNN+IDT-FV,ICCV’15 71.4 SemanticFeatures,CVPR’15 70.5 InTable2(column1),weseethatthesecond-orderten- HierarchicalMid-LevelActions,ICCV’15 66.8 sor performs significantly better than HOK for the MPII HOK(ours) 60.6 dataset. We suspect this surprising behavior is due to the HOK+Second-order(ours) 69.1 highly unbalanced number of frames in sequences in this HOK+second-order+Trajectories 73.1 dataset. For example, for classes such as pull-out, pour, etc., that have only about 7 clips each of 50–90 frames, Table3: MPIICookingActivitiesdataset(7-splits) the second-order is about 30% better than HOK in mAP, while for classes, such as take spice holder, having more Algorithm Avg. Acc. (%) than 25 videos, with 50–150 frames, HOK is about 10% P-CNN,ICCV’15 61.1 better than second-order. This suggests that the poor per- P-CNN+IDT-FV,ICCV’15 72.2 formanceisperhapsduetotheunreliableestimationofdata ActionTubes,CVPR’15 62.5 statistics and that second- and third-order provide comple- StackedFisherVectors,ECCV’14 69.03 mentarycues,asalsowitnessedinTable3.FortheJHMDB IDT+FV,ICCV’13 62.8 dataset,thereareabout30framesinallsequencesandthus HOK(Ours) 64.7 thestatisticsaremoreconsistent. Anotherreasoncouldbe HOK+second-order(Ours) 66.8 thatunbalancedsequencesmaybiastheGMMparameters, HOK+second-order+IDT-FV 73.3 thatformthepivots,toclassesthathavemoreframes. Table4: JHMBDDataset(3-splits) 6.Conclusion Inthispaper,wepresentedatechniqueforhigher-order Acknowledgements: Thisresearchwassupportedbythe poolingofCNNscoresforthetaskofactionrecognitionin Australian Research Council (ARC) through the Centre of videos. Weshowedhowtousetheideaofkernellineariza- ExcellenceforRoboticVision(CE140100016). tion to generate a higher-order kernel descriptor, which References can capture latent relationships between the CNN classi- fier scores. Our experimental analysis on two standard [1] M. Baccouche, F. Mamalet, C. Wolf, C. Garcia, and fine-grainedactiondatasetsclearlydemonstratesthatusing A. Baskurt. Sequential deep learning for human action higher-orderrelationshipsisbeneficialforthetaskandleads recognition. InHumanBehaviorUnderstanding,pages29– tostate-of-the-artperformance. 39.2011. [2] L.BoandC.Sminchisescu. Efficientmatchkernelbetween [23] L.D.Lathauwer,B.D.Moor,andJ.Vandewalle.Amultilin- setsoffeaturesforvisualrecognition. InNIPS,2009. earsingularvaluedecomposition. SIAMJ.MatrixAnalysis [3] T.BroxandJ.Malik. Largedisplacementopticalflow: de- andApplications,21:1253–1278,2000. scriptor matching in variational motion estimation. PAMI, [24] J.Mairal,P.Koniusz,Z.Harchaoui,andC.Schmid. Convo- 33(3):500–513,2011. lutionalkernelnetworks. NIPS,2014. [4] J.Carreira,R.Caseiro,J.Batista,andC.Sminchisescu. Se- [25] B.Ni,V.R.Paramathayalan,andP.Moulin. Multiplegran- manticsegmentationwithsecond-orderpooling. InECCV. ularityanalysisforfine-grainedactiondetection. InCVPR, 2012. 2014. [5] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman. [26] L.Pishchulin, M. Andriluka, and B. Schiele. Fine-grained Returnofthedevilinthedetails: Delvingdeepintoconvo- activityrecognitionwithholisticandposebasedfeatures. In lutionalnets. InBMVC,2014. PatternRecognition,pages678–689.Springer,2014. [6] G.Che´ron, I.Laptev, andC.Schmid. P-CNN:Pose-based [27] A. Rahimi and B. Recht. Random features for large-scale CNNFeaturesforActionRecognition. InICCV,2015. kernelmachines. InNIPS,2007. [7] J.Donahue,L.A.Hendricks,S.Guadarrama,M.Rohrbach, [28] M. Rohrbach, S. Amin, M. Andriluka, and B. Schiele. A S.Venugopalan,K.Saenko,andT.Darrell.Long-termrecur- databaseforfinegrainedactivitydetectionofcookingactiv- rent convolutional networks for visual recognition and de- ities. InCVPR,2012. scription. InCVPR,2015. [29] M. Rohrbach, A. Rohrbach, M. Regneri, S. Amin, M. An- [8] B. Fernando, E. Gavves, J. M. Oramas, A. Ghodrati, and driluka,M.Pinkal,andB.Schiele.Recognizingfine-grained T.Tuytelaars. Modelingvideoevolutionforactionrecogni- and composite activities using hand-centric features and tion. InCVPR,2015. scriptdata. IJCV,119(3):346–373,2016. [9] J. Gall, A. Yao, N. Razavi, L. Van Gool, and V. Lempit- [30] E.ShechtmanandM.Irani. Space-timebehaviorbasedcor- sky. Houghforestsforobjectdetection,tracking,andaction relation. InCVPR,2005. recognition. PAMI,33(11):2188–2202,2011. [31] K.SimonyanandA.Zisserman. Two-streamconvolutional [10] K.Guo,P.Ishwar,andJ.Konrad. Actionrecognitionfrom networksforactionrecognitioninvideos. InNIPS,2014. videousingfeaturecovariancematrices. TIP,2013. [32] D. Tran, L. D. Bourdev, R. Fergus, L. Torresani, and [11] E.Insafutdinov,L.Pishchulin,B.Andres,M.Andriluka,and M. Paluri. Learning spatiotemporal features with 3d con- B.Schiele. Deepercut: Adeeper,stronger,andfastermulti- volutionalnetworks. InICCV,2015. personposeestimationmodel. InECCV,2016. [33] M.A.VasilescuandD.Terzopoulos. Multilinearanalysisof [12] T. Jebara, R. Kondor, and A. Howard. Probability product kernels. JMLR,5:819–844,2004. imageensembles:Tensorfaces. ECCV,2002. [13] H. Jhuang, J. Gall, S. Zuffi, C. Schmid, and M. J. Black. [34] C.Wang,Y.Wang,andA.L.Yuille. Anapproachtopose- Towardsunderstandingactionrecognition. InICCV,2013. basedactionrecognition. InCVPR,2013. [14] S.Ji,W.Xu,M.Yang,andK.Yu. 3dconvolutionalneural [35] H.Wang, A.Kla¨ser, C.Schmid, andC.-L.Liu. Densetra- networks for human action recognition. PAMI, 35(1):221– jectoriesandmotionboundarydescriptorsforactionrecog- 231,2013. nition. IJCV,103(1):60–79,2013. [15] A.Karpathy,G.Toderici,S.Shetty,T.Leung,R.Sukthankar, [36] L. Wang, Y. Qiao, and X. Tang. Action recognition with andL.Fei-Fei. Large-scalevideoclassificationwithconvo- trajectory-pooleddeep-convolutionaldescriptors. InCVPR, lutionalneuralnetworks. InCVPR,2014. 2015. [16] T. G. Kolda and B. W. Bader. Tensor decompositions and [37] S.-E.Wei,V.Ramakrishna,T.Kanade,andY.Sheikh. Con- applications. SIAMReview,51(3):455–500,2009. volutionalposemachines. InCVPR,2016. [17] P. Koniusz and A. Cherian. Sparse coding for third-order [38] C. Williams and M. Seeger. Using the nystro¨m method to super-symmetric tensor descriptors with application to tex- speedupkernelmachines. InNIPS,2001. turerecognition. InCVPR,2016. [39] J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan, [18] P. Koniusz, A. Cherian, and F. Porikli. Tensor representa- O.Vinyals,R.Monga,andG.Toderici. Beyondshortsnip- tionsviakernellinearizationforactionrecognitionfrom3D pets:Deepnetworksforvideoclassification.InCVPR,2015. skeletons. ECCV,2016. [40] Y.Zhou,B.Ni,R.Hong,M.Wang,andQ.Tian. Interaction [19] P.Koniusz,F.Yan,P.Gosselin,andK.Mikolajczyk.Higher- part mining: A mid-level approach for fine-grained action orderoccurrencepoolingforbags-of-words: Visualconcept recognition. InCVPR,2015. detection. PAMI,2016. [41] S.ZuffiandM.J.Black. Puppetflow. IJCV,101(3):437– [20] P.Koniusz,F.Yan,andK.Mikolajczyk.Comparisonofmid- 458,2013. levelfeaturecodingapproachesandpoolingstrategiesinvi- sualconceptdetection. CVIU,2012. [21] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS,2012. [22] H.Kuehne,H.Jhuang,E.Garrote,T.Poggio,andT.Serre. Hmdb:alargevideodatabaseforhumanmotionrecognition. InICCV,2011.