Table Of ContentHigher-order Pooling of CNN Features via
Kernel Linearization for Action Recognition
AnoopCherian1,3 PiotrKoniusz2,3 StephenGould1,3
1 AustralianCenterforRoboticVision 2Data61/CSIRO,
3 TheAustralianNationalUniversity,Canberra,Australia
anoop.cherian@anu.edu.au piotr.koniusz@data61.csiro.au stephen.gould@anu.edu.au
7
1
0
2 Abstract
n
a
Most successful deep learning algorithms for action
J
recognition extend models designed for image-based tasks
9
such as object recognition to video. Such extensions are
1
typicallytrainedforactionsonsinglevideoframesorvery
] shortclips,andthentheirpredictionsfromsliding-windows
V
overthevideosequencearepooledforrecognizingtheac-
C tion at the sequence level. Usually this pooling step uses
. the first-order statistics of frame-level action predictions.
s Figure 1: Fine-grained action instances from two different
c In this paper, we explore the advantages of using higher-
action categories: cut-in (left) and slicing (right). These
[
ordercorrelations;specifically,weintroduceHigher-order
instancesarefromtheMPIIcookingactivitiesdataset[28].
1 Kernel(HOK)descriptorsgeneratedfromthelatefusionof
v CNNclassifierscoresfromalltheframesinasequence. To
2 generate these descriptors, we use the idea of kernel lin-
3 of action recognition, namely fine-grained action recogni-
earization. Specifically, a similarity kernel matrix, which
4 tion. Thisproblemcategoryischaracterizedbyactionsthat
captures the temporal evolution of deep classifier scores,
5
have very weak intra-class appearance similarity (such as
0 is first linearized into kernel feature maps. The HOK
cutting tomatoes versus cutting cucumbers), while strong
. descriptors are then generated from the higher-order co-
1
inter-classsimilarity(peelingcucumbersversusslicingcu-
0 occurrences of these feature maps, and are then used as
cumbers). Figure1illustratestwofine-grainedactions.
7 inputtoavideo-levelclassifier. Weprovideexperimentson
1 twofine-grainedactionrecognitiondatasets,andshowthat Unsurprisingly, the most recent trend for fine-grained
v: ourschemeleadstostate-of-the-artresults. activity recognition is based on convolutional neural net-
i works (CNN) [6, 14]. These schemes extend frameworks
X
developed for general purpose action recognition [31] into
r
a 1.Introduction thefine-grainedsettingbyincorporatingheuristicsthatex-
tract auxiliary discriminative cues, such as the position of
With the resurgence of efficient deep learning algo- body parts or indicators of human-to-object interaction. A
rithms, recent years have seen significant advancements in technicaldifficultywhenextendingCNN-basedmethodsto
severalfundamentalproblemsincomputervision,including videosisthatunlikeobjectsinimages,theactionsinvideos
human action recognition in videos [31, 32, 15, 7]. How- arespreadacrossseveralframes. Thustocorrectlyinferac-
ever, despitethisprogress, solutionsforactionrecognition tionsaCNNmustbetrainedontheentirevideosequence.
are far from being practically useful, especially in a real- However, the current computational architectures are pro-
world setting. This is because real-world actions are often hibitiveinusingmorethanafewtensofframes, thuslim-
subtle,mayusehard-to-detecttools(suchasknives,peelers, iting the size of the temporal receptive fields. For exam-
etc.), may involve strong appearance or human pose vari- ple,thetwostreammodel[31]usessingleframesoratiny
ations, may be of different durations, or done at different sets of optical flow images for learning the actions. One
speeds, among several other complicating factors. In this mayovercomethisdifficultybyswitchingtorecurrentnet-
paper, we study a difficult subset of the general problem works[1]whichtypicallyrequirelargetrainingsetsforef-
fectivelearning. shift-invariant kernels (Section. 3). Using this technique,
Asisclear,usingsingleframestotrainCNNsmightbe wepropose Higher-orderKernel descriptors(Section. 4.3)
insufficient to capture the dynamics of actions effectively, that capture the higher-order co-occurrences of the kernel
while a large stack of frames requires a larger number of maps against the pivots. We apply a non-linear operator
CNN parameters that result in model overfitting, thereby (suchaseigenvaluepowernormalization[19])totheHOK
demandinglargertrainingsetsandcomputationalresources. descriptors as it is known to lead to superior classification
ThisproblemalsoexistsinotherpopularCNNarchitectures performance. The HOK descriptor is then vectorized and
suchas3DCNNs[32,14]. Thus, state-of-the-artdeepac- used for training actions. Note that the HOK descriptors,
tionrecognitionmodelsareusuallytrainedtogenerateuse- thatweproposehere,belongtothemoregeneralfamilyof
ful features from short video clips that are then pooled to third-ordersuper-symmetrictensor(TOSST)descriptorsin-
generateholisticsequenceleveldescriptors,whicharethen troducedin[17,18].
usedtotrainalinearclassifieronactionlabels. Forexam- Weprovideexperiments(Section5)usingtheproposed
ple,inthetwo-streammodel [31],thesoft-maxscoresfrom scheme on two standard action datasets, namely (i) the
thefinalCNNlayersfromtheRGBandopticalflowstreams MPII Cooking activities dataset [28], and (ii) the JHMDB
are combined using average pooling. Note that average dataset [13]. Our results demonstrate that higher-order
pooling captures only the first-order correlations between poolingisusefulandcanperformcompetitivelytothestate-
thescores;ahigher-orderpooling[19]thatcaptureshigher- of-the-artmethods.
order correlations between the CNN features can be more
2.RelatedWork
appropriate, which is the main motivation for the scheme
proposedinthispaper.
Initial approaches to tackle fine-grained action recogni-
Specifically, we assume a two-stream CNN framework tion have been direct extensions of schemes developed for
as suggested in [31] with separate RGB image and optical thegeneralclassificationproblems,mainlybasedonhand-
flow streams. Each of these streams is trained on single crafted features. A few notable such approaches include
RGBoropticalflowframesfromthesequencesagainstthe [26,28,29,35],inwhichfeaturessuchasHOG,SIFT,HOF,
actionlabels. Asnotedabove,whileCNNclassifierpredic- etc., are first extracted at spatio-temporal interest point lo-
tions at the frame-level might be very noisy, we posit that cations (e.g., following dense trajectories) and then fused
the correlations between the temporal evolution of classi- andusedtotrainaclassifier. However,therecenttrendhas
fier scores can capture useful action cues which may help movedtowardsdatadrivenfeaturelearningviadeeplearn-
improve recognition performance. Intuitively, some of the ingplatforms[21,31,14,32,7,39]. Asalludedtoearlier,
actions might have unlabelled pre-cursors (such as Walk- thelackofsufficientannotatedvideodata,andtheneedfor
ing, Picking, etc.). Using higher-order action occurrences expensivecomputationalinfrastructure,makesdirectexten-
inasequencemaybeabletocapturesuchpre-cursors,lead- sion of these frameworks (which are primarily developed
ingtobetterdiscriminationoftheaction,whiletheymaybe forimagerecognitiontasks)challengingforvideodata,thus
ignored as noise when using a first-order pooling scheme. demandingefficientrepresentations.
In this paper, we use this intuition to develop a theoretical Another promising setup for fine-grained action recog-
framework for action recognition using higher-order pool- nition has been to use mid-level features such as human
ingonthetwo-streamclassifierscores. pose. Clearly, estimating the human pose and develop-
Ourpoolingschemeisbasedonkernellinearization—a ing action recognition systems on it disentangles the ac-
simple technique that decomposes a similarity kernel ma- tion inference from operating directly on pixel-level, thus
trixcomputedontheinputdataintermsofasetofanchor allowing for higher-level of sophisticated action reason-
points(pivots).Usingakernel(specifically,aGaussianker- ing [28, 34, 41, 6]. Although, there have been significant
nel) offers richer representational power, e.g., by embed- advancementsintheposeestimationrecently[37,11],most
dingthedatainaninfinitedimensionalHilbertspace. The of these models are computationally demanding and thus
use of the kernel also allows us to easily incorporate addi- difficulttoscaletomillionsofvideoframesthatformstan-
tionalpriorknowledgeabouttheactions. Forexample,we darddatasets.
showinSection4howtoimposeasoft-temporalgrammar Adifferentapproachtofine-grainedrecognitionistode-
ontheactionsinthesequence. Kernelsareaclassofposi- tect and analyze human-object interactions in the videos.
tivedefiniteobjectsthatbelongtoanon-linearbutRieman- Such a technique is proposed in Zhou et al. [40]. Their
niangeometry,andthusdirectlyevaluatingthemforuseina method starts by generating region proposals for human-
dualSVM(c.f. primalSVM)iscomputationallyexpensive. objectinteractionsinthescene,extractsvisualfeaturesfrom
Therefore, these kernels are usually linearized into feature these regions and trains a classifier for action classes on
mapssothatfastlinearclassifierscanbeappliedinstead. In these features. A scheme based on tracking human hands
this paper, we use a linearization technique developed for and their interactions with objects is presented in Ni et
al.[25]. Houghforestsforactionrecognitionareproposed by Πr a . We use the notation A = S ×n P for
j=1 ij k=1 k
inGalletal.[9]. Althoughrecognizingobjectsmaybeuse- matrices P and a tensor S to denote that A =
k (i1,i2,...,in)
ful,theymaynotbeeasilydetectableinthecontextoffine- (cid:80) (cid:80) ...(cid:80) S P(i1,j1)P(i2,j2)...P(in,jn).
grainedactions. Thji1snotja2tionarjinses(iin1,it2h,e...T,inu)cke1rdecom2positionnofhigher-
We also note that there have been several other deep order tensors, see [23, 16] for details. We note that the
learningmodelsdevisedforactionmodelingsuchasusing inner-productbetweentwosuchtensorsfollowsthegeneral
3Dconvolutionalfilters[14],recurrentneuralnetworks[1], element-wise product and summation, as is typically used
long-shorttermmemorynetworks[7],andlargescalevideo inlinearalgebra. Weassumeinthesequelthattheorderr
classification architectures [15]. These models demand is ordinal and greater than zero. We use the notation (cid:104)·,·(cid:105)
hugecollectionsofvideosforeffectivetraining, whichare todenotethestandardEuclideaninnerproductand∆d for
usually unavailable for fine-grained activity tasks and thus thed-dimensionalsimplex.
theapplicabilityofthesemodelsisyettobeascertained.
Poolinghasbeenausefultechniqueforreducingthesize 3.2.KernelLinearization
ofvideorepresentations,therebyenablingtheapplicability LetX ={x ∈Rn :t=1,2,...,T}beasetofdatain-
t
ofefficientmachinelearningalgorithmstothisdatamodal-
stancesproducedbysomedynamicprocessatdiscretetime
ity. Recently,apoolingschemepreservingthetemporalor-
instancest. LetK beakernelmatrixcreatedfromX, the
deroftheframesisproposedbyFernandoetal.[8]bysolv-
ij-thelementofwhichis:
ingaRank-SVMformulation.InWangetal.[36],deepfea-
tures are fused along action trajectories in the video. Cor- K =(cid:104)φ(x ),φ(x )(cid:105), (1)
ij i j
relationsbetweenspace-timefeaturesareproposedin[30].
whereφ(·)representssomekernelizedsimilaritymap.
EarlyandlatefusionofCNNfeaturemapsforactionrecog-
nitionareproposedin[15,39]. Ourproposedhigher-order Theorem1(LinearExpansionoftheGaussianKernel). For
pooling scheme is somewhat similar to the second- and allxandx(cid:48)inRnandσ >0,
higher-order pooling approaches proposed in [4] and [19]
thatgeneraterepresentationsfromlow-leveldescriptorsfor ψ(cid:18)x−x(cid:48)(cid:19)=e(cid:16)−2σ12(cid:107)x−x(cid:48)(cid:107)22(cid:17) (2)
thetaskofsemanticsegmentationofimagesandobjectcat- σ
escgroirpytorrecisoginnsitpioirne,drebsypethcteivseelyq.ueMncoerecoovmepr,atoiubriliHtyOkKerdnee-l =(cid:18) 2 (cid:19)n2 (cid:90) e(−σ12(cid:107)x−z(cid:107)22)e(cid:16)−σ12(cid:107)x(cid:48)−z(cid:107)22(cid:17)dz
πσ2
z∈Rn
(SCK) descriptor introduced in [18] which pools higher-
(3)
orderoccurrencesoffeaturemapsfromskeletalbodyjoints
for action recognition. In contrast, we use the frame-level Proof. See[12,24].
predictionvectors(outputoffc8layers)fromthedeepclas-
sifierstogenerateourpooleddescriptors,therefore,thesize We can approximate the linearized kernel
ofourpooleddescriptorsisafunctionofthenumberofac- by choosing a finite set of pivots z ∈ Z =
tion classes. Moreover, unlike SCK, that uses pose skele- {z1,z2,...,zK;σ1,σ2,...,σK}1. Using Z, we can
tons, we use raw action videos directly. Our work is also rewrite(3)as:
different from works such as [17, 33] in which tensor de-
ψ(x−x(cid:48))≈(cid:104)φ˜(x;Z), φ˜(x(cid:48);Z)(cid:105) (4)
scriptorsareproposedonhand-craftedfeatures. Inthispa-
per, we show how CNNs could benefit from higher-order where
pooling for the application of fine-grained action recogni-
(cid:18) (cid:19)
tion. φ˜(x;Z)= e−σ112(cid:107)x−z1(cid:107)22,...,e−σ1K2 (cid:107)x−zK(cid:107)22 . (5)
3.Background
Wecallφ˜(x;Z)astheapproximatefeaturemapforthein-
Inthissection,wefirstreviewthetensornotationthatwe putdatapointx.
useinthefollowingsections. Thisprecedesanintroduction In the sequel, we linearize the constituent kernels as
tothenecessarytheorybehindkernellinearizationonwhich this allows us to obtain linear maps, which results in
ourdescriptorframeworkisbased. favourablecomputationalcomplexities,i.e.,weavoidcom-
puting a kernel explicitly between tens of thousands of
3.1.Notation
video sequences [18]. Also, see [2, 38] for connections of
Let a ∈ Rd be a d-dimensional vector. Then, we our method with the Nystro¨m approximation and random
use A = a⊗r to denote the r-mode super-symmetric Fourierfeatures[27].
tensor generated by the r-th order outer-product of a, 1To simplify the notation, we assume that the pivot set includes the
where the element at the (i1,i2,...,ir)-th index is given bandwidthsσassociatedwitheachpivotaswell.
4.ProposedMethod whereψ andψ aretwoRBFkernels. Thekernelfunction
1 2
ψ capturesthesimilaritybetweenthetwoclassifierscores
1
Inthissection,wefirstdescribeouroverallCNNframe-
at timesteps t and u, while the kernel ψ puts a smooth-
2
work based on the popular two-stream deep model for ac-
ing on the length of the interval [t,u]. A small bandwidth
tion recognition [31]. This precedes exposition of our
σ will demand the two classifier scores be strongly cor-
T
higher-orderpoolingframework,inwhichweintroduceap-
related at the respective time instances, while a larger σ
T
propriate kernels for the task in hand. We also describe a
allows some variance (and hence more robustness) in cap-
setupforlearningtheparametersofthekernelmaps.
turingthesecorrelations. Inthefollowing,welookatlin-
earizing the kernel in (7) for generating higher-order Ker-
4.1.ProblemFormulation
nel(HOK)descriptors. Theparameterr capturestheorder
LetS = {S ,S ,...,S }beasetofvideosequences, statistics of the kernel, as will be clear in the next section;
1 2 N
each sequence belonging to one of M action classes with ζ and ζ are weights associated with the kernels, and we
1 2
labels from the set L = {(cid:96) ,(cid:96) ,...,(cid:96) }. Let S = assumeζ ,ζ >0,ζ +ζ =1. Λisthenormalizationcon-
1 2 M 1 2 1 2
(cid:104)f ,f ,...,f (cid:105) be a sequence of frames of S ∈ S, and stantassociatedwiththekernellinearization(see(3)). Note
1 2 n
(cid:83)
let F = {f |f ∈S}. In the action recognition thatweassumeallthesequencesareofthesamelengthnin
S∈S i i
setup, our goal is to find a mapping from any given se- the kernel formulation. This is a mere technicality to sim-
quence to its ground truth label. Assume we have trained plifyourderivations.Aswillbeseeninthenextsection,our
frame-level action classifiers for each class and that these HOKdescriptordependsonlyonthelengthofonesequence
classifierscannotseealltheframesinasequencetogether. (seeDefinition1below).
Suppose P : F → [0,1] is one such classifier trained to
m 4.3.Higher-orderKernelDescriptors
produceaconfidencescoreforaninputframetobelongto
the m-th class. Since a single frame is unlikely to repre- Thefollowingeasilyverifiableresult[19]willbehandy
sentwelltheentiresequence,theclassifierPmisinaccurate inunderstandingourderivations.
at determining the action at the sequence level. However,
Proposition 1. Suppose a,b ∈ Rd are two arbitrary vec-
ourhypothesisisthatacombinationofthepredictionsfrom
tors,thenforanordinalr >0
alltheclassifiersacrossalltheframesinasequencecould
capturediscriminativepropertiesoftheactionandcouldim- (cid:0)aTb(cid:1)r =(cid:10)a⊗r,b⊗r(cid:11). (8)
proverecognition. Inthesequel,weexplorethispossibility
For simplifying the notations, we assume xi = P(fi),
inthecontextofhigher-ordertensordescriptors. t t
the score vector for frame f in S . Further, suppose we
t i
4.2.CorrelationsbetweenClassifierPredictions haveasetofpivotsZ andZ fortheclassifierscoresand
F T
thetimesteps, respectively. Then, applyingthekernellin-
Using the notations defined above, let
earization in (5) to (7) using these pivots, we can rewrite
(cid:104)f ,f ,...,f (cid:105),f ∈ S denote a sequence corresponding
1 2 n i eachkernelas:
to each frame and let P (f ) denote the probability that
baeclolansgsitfioecrltarsasin(cid:96)emd.fTohretnh,me mi-th action class predicts fi to ψ1(cid:18)xitσ−Fxju(cid:19)≈zF(cid:88)∈ZF φF(xit;zF)φF(xju;zF) (9)
Pˆ(f )=[P (f ),P (f ),...,P (f )]T , (6) =ΦF(xit;ZF)TΦF(xju;ZF)
i 1 i 2 i M i (cid:18) (cid:19)
t−u (cid:88)
ψ ≈ φ (t;z )φ (u;z ) (10)
denotes the class confidence vector for frame i. As de- 2 σ T T T T
T
scribed earlier, we assume that there exists a strong corre- zT∈ZT
=Φ (t;Z )TΦ (u;Z ).
lationbetweentheconfidencesoftheclassifiersacrosstime T T T T
(temporal evolution of the classifier scores) for the frames
Substituting(11)into(7),wehave:
from similar sequences; i.e., frames that are confused be-
tween different classifiers should be confused in a similar
n
way for different sequences. To capture these correlations K(S ,S )≈ 1 (cid:88) (cid:2)ζ Φ (xi;Z )TΦ (xj;Z )+
betweentheclassifierpredictions,weproposetouseaker- i j Λ 1 F t F F u F
t,u=1
nelformulationonthescoresfromsequences,theij-then- ζ Φ (t;Z )TΦ (u;Z )(cid:3)r (11)
tryofthiskernelisasfollows: 2 T T T T
K(S ,S )=1 (cid:88)n (cid:34)ζ ψ(cid:32)Pˆ(ft)−Pˆ(fu)(cid:33)+ζ ψ (cid:18)t−u(cid:19)(cid:35)r, =1 (cid:88)n (cid:42)(cid:20)√√ζ1ΦF(xit;ZF)(cid:21)⊗r,(cid:20)√√ζ1ΦF(xju;ZF)(cid:21)⊗r(cid:43),
i j Λt,u=1 1 1 σF 2 2 σT Λt,u=1 ζ2ΦT(t;ZT) ζ2ΦT(u;ZT)
(7) (12)
Figure 2: Our overall Higher-order Kernel (HOK) descriptor generation and classification pipeline. PN stands for eigen
power-normalization.
whereweappliedProposition1to(11).Aseachcomponent 4.4.PowerNormalization
intheinnerproductin(12)isindependentintherespective
It is often found that using a non-linear operator on
temporal indexes, we can carry the summation inside the
higher-order tensors leads to significantly better perfor-
termsleadingto:
mance [20]. For example, for BOW, unit-normalization is
(cid:42) 1 (cid:88)n (cid:20) √ζ Φ (xi;Z ) (cid:21)⊗r known to avoid the impact of background features, while
(12)→ √ √1 F t F , taking the feature square-root reduces burstiness. Mo-
Λ ζ2ΦT(t;ZT)
t=1 tivated by these observations, we may incorporate such
1 (cid:88)n (cid:20) √ζ Φ (xj;Z ) (cid:21)⊗r(cid:43) non-linearities to the HOK descriptors as well. As these
√ √1 F u F . (13)
Λ ζ2ΦT(u;ZT) are higher-order tensors, we apply the following scheme
u=1 based on the higher-order SVD decomposition of the ten-
Using these derivations, now we are ready to formally sor[19,17,18]. LetH denotetheHOK(X),then
define our Higher-order Kernel (HOK) descriptor as fol-
lows: [Σ,U1,U2,...,Ur]=HOSVD(H) (15)
Definition1(HOK). LetX =(cid:104)x1,x2,...,xn(cid:105),xi ∈[0,1]d Hˆ =sign(Σ)|Σ|α×ri=1Ui (16)
aretheprobabilityscoresfromdclassifiersforthenframes
where the U’s are the orthonormal matrices (which are all
in a sequence. Then, we define the r-th order HOK-
thesameinourcase)associatedwiththeTuckerdecomposi-
descriptorforX as:
tion[23]andΣisthecoretensor.Notethat,unliketheusual
√
1 (cid:88)n (cid:20) ζ Φ (x ;Z ) (cid:21)⊗r SVD operation for matrices, the core tensor in HOSVD is
HOK(X)= √ √1 F u F (14)
Λ ζ2ΦT(u;ZT) generallynotdiagonal. RefertothenotationsinSection3
u=1 for the definition of ×n . We use Hˆ in training the lin-
i=1
for pivot sets Z ⊂ ∆d and Z ⊂ R for the classifier ear classifiers after vetorization. The power-normalization
F T
scores and the temporal instances respectively. Further, parameterαisselectedviacross-validation.
ζ ,ζ ∈ [0,1] such that ζ +ζ = 1, and Λ is a suitable
1 2 1 2
4.5.ComputationalComplexity
normalization.
For M classes, n frames per sequence, |Z| pivots, and
Once the HOK tensor is generated for a sequence, we
tensor-orderr,generatingHOKtakesO(n(M+|Z|r)). As
vectorize it to be used in a linear classifier for training
HOK is super-symmetric, using truncated SVD for a rank
againstactionlabels. Ascanbeverified(i.e.,see[19]),the
k <<|Z|,HOSVDtakesonlyO(k2|Z|)time. See[18]for
HOK tensor will be super-symmetric, and thus removing
moredetails.
thesymmetricentries,thedimensionalityofthisdescriptor
(cid:18) (cid:19)
|Z |+|Z |+r−1
is F T . Inthesequel,weuser = 3 4.6.LearningHOKParameters
r
asatrade-offbetweenperformanceandthedescriptorsize. AnimportantstepinusingtheHOKdescriptoristofind
Figure2illustratesouroverallHOKgenerationandclassi- appropriatepivotsetsZ andZ . Giventhatthetemporal
F T
ficationframework. pivots are uni-dimensional, we select them to be equally-
spaced along the time axis after normalizing the tempo- dataset. For the former, we use the evaluation code pub-
ral indexes to [0,1]. For Z , which operate on the clas- lishedwiththedataset.
F
sifier scores, that can be high-dimensional, we propose to
5.3.Preprocessing
use an expectation-maximization algorithm. This choice
is motivated by the fact that the entries for ζ1ΦF(xu;ZF) As the original MPII cooking videos are of very high
in(12)areessentiallycomputingasoft-similaritybetween resolution, whiletheactivitieshappenonlyatcertainparts
theclassifierscorevectorsforeveryframeagainstthepiv- of the scene, we used a frame difference scheme to esti-
otsthroughaGaussiankernel. Thusmodelingtheproblem mateawindowofthescenetolocalizetheaction.Precisely,
inasoft-assignmentsetupusingaGaussianmixturemodel foreverysequence,wefirstconverttheframestohalftheir
is natural, the parameters (the mean and the variance) are sizes,followedbyframe-differencing,dilation,smoothing,
learnedusingtheEMalgorithm;theseparametersareused andconnectedcomponentanalysis. Thisresultsinabinary
asthepivotset.Otherparametersinthemodel,suchasζare imageforeveryframe;whicharethencombinedacrossthe
computedusingcross-validation. Thenormalizationfactor sequence and a binary mask is generated for the entire se-
Λischosentoben2wherenissequencelength. quence. Weusethelargestboundingboxcontainingallthe
connectedcomponentsinthisbinarymaskastheregionof
5.ExperimentsandResults
the action, and crops the video to this box. Such cropped
framesarethenresizedto224×224sizeandusedtotrain
Thissectionprovidesexperimentalevidenceoftheuse-
the VGG networks. To compute optical flow, we used the
fulnessofourproposedpoolingschemeforfine-grainedac-
Brox implementation [3]. Each flow image is rescaled to
tionrecognition. Weverifythisontwopopularbenchmark
0–255andsavedasaJPEGimageforstorageefficiencyas
datasetsforthistask,namely,(i)theMPIIcookingactivities
described in [31]. For the JHMDB dataset, the frames are
dataset[28],and(ii)theJHMDBdataset[13]. Notethatwe
alreadyinlowresolution. Thus,wedirectlyusetheminthe
useaVGG-16model[5]forthetwostreamarchitecturefor
CNNafterresizingtotheexpectedinputsizes.
both datasets, which is pre-trained on ImageNet for object
recognitionandfine-tunedontherespectiveactiondatasets. 5.4.CNNTraining
5.1.Datasets ThetwostreamsoftheCNNaretrainedseparatelyonthe
respectiveinputmodalitiesagainstasoftmaxcross-entropy
MPIICookingAcitiviesDataset[28]: consistsofhigh-
loss. We used the sequences from the training set of the
resolution videos of cooking activities recorded by a sta-
MPIIcookingactivitiesdatasetfortrainingtheCNNs(1992
tionary camera. The dataset consists of videos of people
sequences)andusedthosefromtheprovidedvalidationset
cookingvariousdishes;eachvideocontainsasingleperson
(615 sequences) to check for overfitting. For the JHMDB,
cookingadish, andoverallthereare12suchvideosinthe
we used 50% of the training set for fine-tuning the CNNs
dataset. Thereare64distinctactivitiesspreadacross3748
ofwhich10%isusedasthevalidationset. Weaugmented
video clips and one background activity (1861 clips). The
the datasets using random crops, flips and slight rotations
activities range from coarse subject motions such as mov-
of the frames. While fine-tuning the CNNs (from a pre-
ing from X to Y, opening refrigerator, etc., to fine-grained
trained imagenet model), we used a fixed learning rate of
actionssuchaspeel,slice,cutapart,etc.
10−4 andaninputbatchsizeof50frames. TheCNNtrain-
ing was stopped as soon as the loss on the validation set
JHMDB Dataset [13]: is a subset of the larger HMDB started increasing, which happened in about 6K iterations
dataset[22],butcontainsvideosinwhichthehumanlimbs for the appearance stream and 40K iterations for the flow
can be clearly visible. The dataset contains 21 action cat- stream.
egories such as brush hair, pick, pour, push, etc. Unlike
5.5.HOKParameterLearning
the MPII cooking activities dataset, the videos in JHMDB
dataset are low resolution. Each video clip is a few sec-
AsisclearfromDef.1,thereareafewhyper-parameters
ondslong. Thereareatotalof968videoswhicharemostly
associatedwiththeHOKdescriptor. Inthissection,wesys-
downloadedfromtheinternet.
tematicallyanalyzetheinfluenceoftheseparameterstothe
overallclassificationperformanceofthedescriptor. Tothis
5.2.EvaluationProtocols
end, we use the JHMDB dataset split1. Specifically, we
Wefollowthestandardprotocolssuggestedintheorig- explore the effect of changes to (i) the number of classi-
inal publications that introduced these datasets. Thus, we fierpivotsZ , (ii)thenumberoftemporalpivotsZ , and
F T
use the mean average precision (mAP) over 7-fold cross- (iii) that of the power-normalization factor α. In Figure 3,
validationfortheMPIIdataset,whileweusethemeanav- we plot the classifier accuracy against each of these cases.
erageaccuracyover3-foldcross-validationfortheJHMDB Each experiment is repeated 5 times with different initial-
(a) (b) (c)
Figure3: Analysisoftheinfluenceofvarioushyper-parametersontheactionrecognitionaccuracy. Thenumbersarecom-
putedonthesplit1oftheJHMDBdataset,whichconsistsof21groundtruthactionclasses.
izations (for the GMM) and the average accuracy is used Action Avg. Pool HOK.Pool
fortheplots. mAP(%) mAP(%)
For(i),wefixedthenumberoftemporalpivotsto5,with ChangeTemperature 15.1 57.5
values [0,0.25,0.5,0.75,1] and fixed the σ = 0.1. The Dry 27.7 50.2
T
classifier pivots Z and and their standard deviations σ Fillwaterfromtap 10.5 40.6
F F
are found by learning GMM models with the prescribed Open/closedrawer 25.2 65.1
number of Gaussian components. The mean and the di-
Table 1: An analysis of per-class action recognition accu-
agonal variance from this learned model are then used as
racywhenusingaveragepoolingandHOKpooling(thetop
thepivotsetanditsvariancerespectively. Asisclearfrom
classescorrectedbyHOKpooling).
Figure3(a),asthenumberofclassifierpivotsincreases,the
accuracyincreasesaswell.However,beyondacertainnum-
ber,theaccuracystartsdropping. Thisisperhapsduetothe
second-orderstatisticsandourproposedthird-order.For(i),
sequencesnotcontainingsufficientnumberofframestoac-
weaveragetheclassifiersoft-maxscoresasisusuallydone
count for larger models. Note that the JHMDB sequences
inlatepooling[31].For(ii),weusethesecond-orderkernel
containabout30-40framespersequence. Wealsonotethat
matrix without pivoting. Specifically, for every sequence,
the accuracy of Flow+RGB is significantly higher than ei-
letxiandxjdenotetheprobabilisticevolutionofprobablis-
therstreamalone. : :
ticscoresforclassifiersiandjrespectively. Then,wecom-
For(ii),wefixthenumberofclassifierpivotsat48(asis
thebestwefoundfromFigure3(b)),andvariedthenumber puteakernelmatrixK(xi:,xj:) = e(cid:16)−σ(cid:107)xi:−xj:(cid:107)22(cid:17). Asthis
oftemporalpivotsfrom1to30instepsof5. Similartothe matrix is a positive definite object, we use log-Euclidean
classifierpivots,wefindthatincreasingthenumberoftem- mapofit(thatis,thematrixlogarithm;whichistheasymp-
poralpivotsisbeneficial.Further,alargerσ leadstoadrop toticlimitofα→0inpowernormalization)forembedding
T
inaccuracy,whichimpliesthatorderingoftheprobabilistic itintheEuclideanspace[10]. Thisvectoristhenusedfor
scoresdoesplayaroleintherecognitionoftheactivity. training. As isclear, this matrix captures the second-order
For (iii), we fixed the number of classifier pivots at 48, statisticsofactions.Andfor(iii),weusetheproposedHOK
andthenumberoftemporalpivotsto5(asdescribedfor(i) descriptorasdescribedinDefinition1. AsisclearfromTa-
above). Wevariedαfrom0.1to1instepsof0.1. Wefind ble2,higher-orderstatisticsleadstosignificantbenefitson
thatαcloserto0ismorepromising,implyingthatthereis boththedatasetsandforboththeinputmodalities(flowand
significantinfluenceofburstinessinthesequences. Thatis, RGB)andtheircombinations.
reducingmorethelargerprobabilisticco-occurrences(than
5.7.ComparisonstotheStateoftheArt
thoseoftheweakco-occurrences)inthetensorleadstobet-
terperformance. In Tables 3 and 4, we compare HOK descriptors to the
state-of-the-art results on these two datasets. In this case,
5.6.Results
we combine the HOK descriptors from both the RGB and
In this section, we provide full experimental compar- flowstreamsoftheCNN.FortheMPIIdataset, weuse32
isons for the two datasets. Our main goal is to analyze pivotsfortheclassifierscores, and5equispacedpivotsfor
the usefulness of higher-order pooling for action recogni- the time steps, with σ = 0.1. For the second-order ten-
T
tion. Tothisend,inTable2,weshowtheperformancedif- sor, we use a σ = 0.1 for both datasets. We use the same
ferences between using (i) the first-order statistics, (ii) the setupfortheJHMDBdataset,exceptthatweuse48pivots.
The power normalization factor is set to 0.1. As is clear, Experiment MPII JHMDB
although HOK by itself is not superior to other methods, mAP(%) MeanAvg. Acc(%)
when the second- and third-order statistics are combined RGB(avg.pool) 33.9 51.5
(stackingtheirvaluesintoavector),itdemonstratessignif- Flow(avg.pool) 37.6 54.8
icantpromise. Forexample, weseeanimprovementof5– RGB+Flow(avg.pool) 38.1 55.9
6%againsttherecentmethodin[6]thatalsousesaCNN. RGB(second-order) 56.1 52.3
Further, we also find that when the higher-order statistics Flow(second-order) 61.3 60.4
are combined with trajectory features, there is further im- RGB+Flow(second-order) 67.0 63.4
provement in accuracy, which results in a model that out- RGB(HOK) 47.8 52.3
performsthestateoftheart. Flow(HOK) 55.4 58.2
RGB+Flow(HOK) 60.6 64.7
5.8.AnalysisofResults
Table 2: Evaluation of the HOK descriptor on the output
To gain insights into the performance benefits noted ofeachCNNstreamandtheirfusionontheMPII(7-splits)
above, we conducted an analysis of the results on the andJHMDBdatasets(3-splits). Wealsoshowtheaccuracy
MPII dataset. Table 1 lists the activities that are initially obtainedviasecond-orderpoolingthatusesthekernelma-
confused in average pooling, while corrected by HOK. trixdirectlywithoutlinearization(seetextfordetails).
Specifically, we find that activities such as Fill water from
tap and Open/close Drawer which are originally confused Algorithm mAP(%)
withWashObjectsandTakeoutfromdrawergetscorrected Holistic+Pose,CVPR’12 57.9
usinghigher-orderpooling.Notethattheseactivitiesarein- VideoDarwin,CVPR’15 72.0
herentlyambiguous,unlesscontextandsub-actionsarean- InteractionPartMining,CVPR’15 72.4
alyzed. Thisshowsthatourdescriptorcaneffectivelyrep- P-CNN,ICCV’15 62.3
resentusefulcuesforrecognition. P-CNN+IDT-FV,ICCV’15 71.4
SemanticFeatures,CVPR’15 70.5
InTable2(column1),weseethatthesecond-orderten-
HierarchicalMid-LevelActions,ICCV’15 66.8
sor performs significantly better than HOK for the MPII
HOK(ours) 60.6
dataset. We suspect this surprising behavior is due to the
HOK+Second-order(ours) 69.1
highly unbalanced number of frames in sequences in this
HOK+second-order+Trajectories 73.1
dataset. For example, for classes such as pull-out, pour,
etc., that have only about 7 clips each of 50–90 frames,
Table3: MPIICookingActivitiesdataset(7-splits)
the second-order is about 30% better than HOK in mAP,
while for classes, such as take spice holder, having more
Algorithm Avg. Acc. (%)
than 25 videos, with 50–150 frames, HOK is about 10%
P-CNN,ICCV’15 61.1
better than second-order. This suggests that the poor per-
P-CNN+IDT-FV,ICCV’15 72.2
formanceisperhapsduetotheunreliableestimationofdata
ActionTubes,CVPR’15 62.5
statistics and that second- and third-order provide comple-
StackedFisherVectors,ECCV’14 69.03
mentarycues,asalsowitnessedinTable3.FortheJHMDB
IDT+FV,ICCV’13 62.8
dataset,thereareabout30framesinallsequencesandthus
HOK(Ours) 64.7
thestatisticsaremoreconsistent. Anotherreasoncouldbe
HOK+second-order(Ours) 66.8
thatunbalancedsequencesmaybiastheGMMparameters,
HOK+second-order+IDT-FV 73.3
thatformthepivots,toclassesthathavemoreframes.
Table4: JHMBDDataset(3-splits)
6.Conclusion
Inthispaper,wepresentedatechniqueforhigher-order Acknowledgements: Thisresearchwassupportedbythe
poolingofCNNscoresforthetaskofactionrecognitionin Australian Research Council (ARC) through the Centre of
videos. Weshowedhowtousetheideaofkernellineariza- ExcellenceforRoboticVision(CE140100016).
tion to generate a higher-order kernel descriptor, which
References
can capture latent relationships between the CNN classi-
fier scores. Our experimental analysis on two standard
[1] M. Baccouche, F. Mamalet, C. Wolf, C. Garcia, and
fine-grainedactiondatasetsclearlydemonstratesthatusing A. Baskurt. Sequential deep learning for human action
higher-orderrelationshipsisbeneficialforthetaskandleads recognition. InHumanBehaviorUnderstanding,pages29–
tostate-of-the-artperformance. 39.2011.
[2] L.BoandC.Sminchisescu. Efficientmatchkernelbetween [23] L.D.Lathauwer,B.D.Moor,andJ.Vandewalle.Amultilin-
setsoffeaturesforvisualrecognition. InNIPS,2009. earsingularvaluedecomposition. SIAMJ.MatrixAnalysis
[3] T.BroxandJ.Malik. Largedisplacementopticalflow: de- andApplications,21:1253–1278,2000.
scriptor matching in variational motion estimation. PAMI, [24] J.Mairal,P.Koniusz,Z.Harchaoui,andC.Schmid. Convo-
33(3):500–513,2011. lutionalkernelnetworks. NIPS,2014.
[4] J.Carreira,R.Caseiro,J.Batista,andC.Sminchisescu. Se- [25] B.Ni,V.R.Paramathayalan,andP.Moulin. Multiplegran-
manticsegmentationwithsecond-orderpooling. InECCV. ularityanalysisforfine-grainedactiondetection. InCVPR,
2012. 2014.
[5] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman. [26] L.Pishchulin, M. Andriluka, and B. Schiele. Fine-grained
Returnofthedevilinthedetails: Delvingdeepintoconvo- activityrecognitionwithholisticandposebasedfeatures. In
lutionalnets. InBMVC,2014. PatternRecognition,pages678–689.Springer,2014.
[6] G.Che´ron, I.Laptev, andC.Schmid. P-CNN:Pose-based
[27] A. Rahimi and B. Recht. Random features for large-scale
CNNFeaturesforActionRecognition. InICCV,2015. kernelmachines. InNIPS,2007.
[7] J.Donahue,L.A.Hendricks,S.Guadarrama,M.Rohrbach,
[28] M. Rohrbach, S. Amin, M. Andriluka, and B. Schiele. A
S.Venugopalan,K.Saenko,andT.Darrell.Long-termrecur-
databaseforfinegrainedactivitydetectionofcookingactiv-
rent convolutional networks for visual recognition and de-
ities. InCVPR,2012.
scription. InCVPR,2015.
[29] M. Rohrbach, A. Rohrbach, M. Regneri, S. Amin, M. An-
[8] B. Fernando, E. Gavves, J. M. Oramas, A. Ghodrati, and
driluka,M.Pinkal,andB.Schiele.Recognizingfine-grained
T.Tuytelaars. Modelingvideoevolutionforactionrecogni-
and composite activities using hand-centric features and
tion. InCVPR,2015.
scriptdata. IJCV,119(3):346–373,2016.
[9] J. Gall, A. Yao, N. Razavi, L. Van Gool, and V. Lempit-
[30] E.ShechtmanandM.Irani. Space-timebehaviorbasedcor-
sky. Houghforestsforobjectdetection,tracking,andaction
relation. InCVPR,2005.
recognition. PAMI,33(11):2188–2202,2011.
[31] K.SimonyanandA.Zisserman. Two-streamconvolutional
[10] K.Guo,P.Ishwar,andJ.Konrad. Actionrecognitionfrom
networksforactionrecognitioninvideos. InNIPS,2014.
videousingfeaturecovariancematrices. TIP,2013.
[32] D. Tran, L. D. Bourdev, R. Fergus, L. Torresani, and
[11] E.Insafutdinov,L.Pishchulin,B.Andres,M.Andriluka,and
M. Paluri. Learning spatiotemporal features with 3d con-
B.Schiele. Deepercut: Adeeper,stronger,andfastermulti-
volutionalnetworks. InICCV,2015.
personposeestimationmodel. InECCV,2016.
[33] M.A.VasilescuandD.Terzopoulos. Multilinearanalysisof
[12] T. Jebara, R. Kondor, and A. Howard. Probability product
kernels. JMLR,5:819–844,2004. imageensembles:Tensorfaces. ECCV,2002.
[13] H. Jhuang, J. Gall, S. Zuffi, C. Schmid, and M. J. Black. [34] C.Wang,Y.Wang,andA.L.Yuille. Anapproachtopose-
Towardsunderstandingactionrecognition. InICCV,2013. basedactionrecognition. InCVPR,2013.
[14] S.Ji,W.Xu,M.Yang,andK.Yu. 3dconvolutionalneural [35] H.Wang, A.Kla¨ser, C.Schmid, andC.-L.Liu. Densetra-
networks for human action recognition. PAMI, 35(1):221– jectoriesandmotionboundarydescriptorsforactionrecog-
231,2013. nition. IJCV,103(1):60–79,2013.
[15] A.Karpathy,G.Toderici,S.Shetty,T.Leung,R.Sukthankar, [36] L. Wang, Y. Qiao, and X. Tang. Action recognition with
andL.Fei-Fei. Large-scalevideoclassificationwithconvo- trajectory-pooleddeep-convolutionaldescriptors. InCVPR,
lutionalneuralnetworks. InCVPR,2014. 2015.
[16] T. G. Kolda and B. W. Bader. Tensor decompositions and [37] S.-E.Wei,V.Ramakrishna,T.Kanade,andY.Sheikh. Con-
applications. SIAMReview,51(3):455–500,2009. volutionalposemachines. InCVPR,2016.
[17] P. Koniusz and A. Cherian. Sparse coding for third-order [38] C. Williams and M. Seeger. Using the nystro¨m method to
super-symmetric tensor descriptors with application to tex- speedupkernelmachines. InNIPS,2001.
turerecognition. InCVPR,2016. [39] J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan,
[18] P. Koniusz, A. Cherian, and F. Porikli. Tensor representa- O.Vinyals,R.Monga,andG.Toderici. Beyondshortsnip-
tionsviakernellinearizationforactionrecognitionfrom3D pets:Deepnetworksforvideoclassification.InCVPR,2015.
skeletons. ECCV,2016. [40] Y.Zhou,B.Ni,R.Hong,M.Wang,andQ.Tian. Interaction
[19] P.Koniusz,F.Yan,P.Gosselin,andK.Mikolajczyk.Higher- part mining: A mid-level approach for fine-grained action
orderoccurrencepoolingforbags-of-words: Visualconcept recognition. InCVPR,2015.
detection. PAMI,2016.
[41] S.ZuffiandM.J.Black. Puppetflow. IJCV,101(3):437–
[20] P.Koniusz,F.Yan,andK.Mikolajczyk.Comparisonofmid-
458,2013.
levelfeaturecodingapproachesandpoolingstrategiesinvi-
sualconceptdetection. CVIU,2012.
[21] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet
classification with deep convolutional neural networks. In
NIPS,2012.
[22] H.Kuehne,H.Jhuang,E.Garrote,T.Poggio,andT.Serre.
Hmdb:alargevideodatabaseforhumanmotionrecognition.
InICCV,2011.