Table Of ContentFEATURESAMPLINGSTRATEGIESFORACTIONRECOGNITION
YoujieZhou,HongkaiYuandSongWang
DepartmentofComputerScienceandEngineering,UniversityofSouthCarolina
zhou42@email.sc.edu,yu55@email.sc.edu,songwang@cec.sc.edu
ABSTRACT
5
1 Although dense local spatial-temporal features with bag-of-
0
features representation achieve state-of-the-art performance
2
for action recognition, the huge feature number and feature Video Frame Dense Sampling Random Sampling Selective Sampling
n sizepreventcurrentmethodsfromscalinguptorealsizeprob-
a
lems. In this work, we investigate different types of feature Fig.1. Differentfeaturesamplingmethodsforactionrecog-
J
samplingstrategiesforactionrecognition,namelydensesam- nition.
8
pling, uniformly random sampling and selective sampling.
2
We propose two effective selective sampling methods using
part model by which they are able to randomly sample fea-
] objectproposaltechniques.Experimentsconductedonalarge
V tures at lower image scales in an efficient way. [6] inter-
videodatasetshowthatweareabletoachievebetteraverage
C recognitionaccuracyusing25%lessfeatures,throughoneof polated trajectories using uniformly distributed nearby fea-
s. proposedselectivesamplingmethods,andevenremaincom- ture points. [4] investigated the influence of random sam-
plingonrecognitionaccuracyinseverallargescaledatasets.
c parableaccuracywhilediscarding70%features.
[ However, intuitively, features extracted around informative
IndexTerms—Actionrecognition,Videoanalysis,Feature regions,suchashumanarmsinhandswaving,shouldbemore
1
v sampling useful in action classification than features extracted on the
3 background. [7,8]proposedselectivesamplingstrategieson
9 1. INTRODUCTION densetrajectoryfeaturesbasedonsaliencymaps,producedby
9
modelinghumaneyemovementwhenviewingvideos. They
6
Giventhepopularityofsocialmedia,itbecomesmucheasier areabletoachievebetterrecognitionresultswithselectively
0
. to collect a large number of videos from Internet for human sampled features. However, it is impractical to obtain eye
1
actionrecognition. Effectivevideorepresentationisrequired movementdataforlargedatasets.
0
5 forrecognizinghumanactionsandunderstandingvideocon- Inthiswork,weinvestigateseveralfeaturesamplingstrate-
1 tentinsuchrapidlyincreasingunstructureddata. gies for action recognition, as illustrated in Fig.1, and pro-
: By far, the commonly used video representation for ac- posetwodatadrivenselectivefeaturesamplingmethods. In-
v
i tion recognition has been the bag-of-words (BoW) model spiredbythesuccessofapplyingobjectproposaltechniques
X
[1]. The basic idea is summarizing/encoding local spatial- inefficientsaliencydetection[9],weconstructsaliencymaps
r temporal features in a video as a simple vector. Among lo- usingonerecentobjectproposalmethod, EdgeBox[10,11],
a
calfeatures,densetrajectory(DT)[2]anditsimprovedvari- and selectively sample dense trajectory features for action
ant (iDT) [3] provide state-of-the-art results on most action recognition.WefurtherextendEdgeBoxtoproduceproposals
datasets [3]. The main idea is to construct trajectories by andconstructsaliencymapsforobjectswithmotionofinter-
trackingdenselysampledfeaturepointsinframes,andcom- ests. More effective features can be sampled then for action
putemultipledescriptorsalongthetrajectories. classification. Weevaluatedafewfeaturesamplingmethods
Despitetheirsuccess,DTandiDTcanproducehugenum- onapubliclyavailabledatasets,andshowthatproposedmo-
ber of local features, e.x., for a low resolution video in tionobjectproposalbasedselectivesamplingmethodisable
320×204 with 175 frames, they can generate ∼ 52 Mb of toachievebetteraccuracyusing25%lessfeaturesthanusing
features[4]. Itisdifficulttostoreandmanipulatesuchdense thefullfeatureset.
features for large datasets with thousands of high resolution The remaining of this paper is organized as follows: first
videos,especiallyforreal-timeapplications. we give a brief introduction about the DT/iDT features and
Existingworkfocusonreducingthetotalnumberoftrajec- othercomponentsinouractionclassificationframework,then
toryfeaturesthroughuniformlyrandomsamplingatthecost three different feature sampling methods are described. Fi-
of minor reduction in recognition accuracy. [5] proposed a nally,wediscussexperimentalresultsonalargevideodataset.
air
h
h
us
br
h
c
at
c
p
m
u
j
k
c
pi
all
b
e
as
b
g
n
wi
s
Fig.2. Illustrationofselectivesamplingmethodsviaobjectproposalalgorithms. Fromlefttoright,theoriginalvideoframe,
denseopticalflowfield,estimatedobjectboundaries,top5scoringboxesgeneratedbyEdgeBox,saliencymapconstructedusing
EdgeBoxproposals,estimatedmotionboundaries,top5scoringboxesgeneratedbyFusionEdgeBox,saliencymapconstructed
usingFusionEdgeBox.
2. DENSETRAJECTORYFEATURES ity by a factor of two using Principal Component Analysis
(PCA).Thenacodebookofsize256isformedbytheGaus-
TheDTalgorithm[2]representsavideodatabydensetrajec- sianMixtureModel(GMM)algorithmonarandomselection
tories,togetherwithappearanceandmotionfeaturesextracted of256,000featuresfromthetrainingset. Tocombinediffer-
around trajectories. On each video frame, feature points are enttypesoffeatures,wesimplyconcatenatetheirl normal-
2
denselysampledusingagridwithas√pacingof5pixelsfor8 izedFishervectors.
spatialscalesspacedbyafactorof1/ 2,asillustratedinthe Forclassification,weapplyalinearSVMprovidedbyLIB-
second column of Fig.1. Then trajectories are constructed SVM[16],andone-over-restapproachisusedformulti-class
by tracking feature points in the video based on dense opti- classification. Inallexperiments,wefixC =100inSVMas
cal flows [12]. The default length of a trajectory is 15, i.e., suggestedin[3].
tracking feature points in 15 consecutive frames. The iDT
algorithm[3]furtherenhancesthetrajectoryconstructionby
eliminatingbackgroundmotionscausedbythecameramove- 3. FEATURESAMPLINGSTRATEGIES
ment.
For each trajectory, 5 types of descriptors are extracted: Inthefollowing,wedescribethreefeaturesamplingmethods,
1) the shape of the trajectory encodes local motion patterns, that are different from using all trajectories and related fea-
whichisdescribedbyasequenceofdisplacementvectorson tures computed on dense grids as in the DT/iDT algorithms.
bothx-andy-directions;2)HOG,histogramoforientedgra- Allthreemethodscanderiveasamplingprobabilityforeach
dients [13], captures appearance information, which is com- trajectory feature to measure whether it will be sampled or
putedina32×32×15spatio-temporalvolumesurrounding not, denoted by σ. For example, σ = 0.8 means we sample
thetrajectory;3)HOF,histogramofopticalflow[14],focuses trajectoryfeatureswithprobabilitygreaterorequalto0.8for
onlocalmotioninformation,whichiscomputedinthesame actionrecognition.
spatio-temporalvolumeasinHOG;4+5)MBHxandMBHy,
motion boundary histograms [14], are computed separately
3.1. UniformlyRandomSampling
for the horizontal and vertical gradients of the optical flow.
BothHOG,HOFandMBHarenormalizedappropriately. Followingpreviouswork[5,4],wesimplysampledensetra-
To encode descriptors/features, we use Fisher vector [15] jectoryfeaturesinarandomanduniformway. Thesampling
as in [3]. For each feature, we first reduce its dimensional- probability,σ,foreachtrajectoryisthesame.Inexperiments,
we randomly sample 80%, 60%, 40% and 30% of trajec- where s is the original EdgeBox score, s is the pro-
obj motion
tory features, and report their action recognition accuracies posedmotionobjectivenessscore, andbalanceparametersα
respectively. and β. We empirically fix α = β = 1 for all experiments.
s isdefinedsimilarass ,i.e., basedonthenumberof
motion obj
whollyenclosedcontoursinabox. However, s utilizes
motion
3.2. SelectiveSamplingviaObjectProposal
contoursthataregroupedfrommotionboundaries,whichare
estimatedasimagegradientsoftheopticalflowfield. Motion
EdgeBox [10] is one of efficient object proposal algorithms
boundaryexamplesareshowninthesixthcolumnofFig.2.
[11] published recently. We utilize it to construct saliency
ByapplyingthefusionscoreintotheEdgeBoxframework,
maponeachvideoframe,andsampletrajectoryfeatureswith
weareabletogenerateasetofproposalboxes,andconstruct
respecttocomputedsaliencyvalues.
the saliency map for feature sampling as well. Examples of
In EdgeBox, given a video frame, object boundaries are
top5scoringfusionboxesandconstructedsaliencymapsare
estimatedviastructureddecisionforests[17],andobjectcon-
illustratedinlasttwocolumnsofFig.2respectively.Compar-
tours are formed by grouping detected boundaries with sim-
ingwithexamplesgeneratedbytheoriginalEdgeBox(shown
ilar orientations. In order to determine how likely a bound-
in columns 3-5), we can see that FusionEdgeBox is able to
ing box contains objects of interests, a simple but effective
betterexploreregionswithmotionofinterests,whichisuse-
objectiveness score s was proposed, based on the number
obj fulforactionfeaturesampling(verifiedbylaterexperiments).
of contours that are wholly enclosed by the box. We allow
Similarly, wereportrecognitionaccuraciesusingsampled
at most 10,000 boxes in different sizes and aspect ratios to
trajectoryfeaturesforσwith0.2,0.4and0.6respectively.
be examined for a frame. Fig.2 illustrates estimated object
boundariesandtop5scoringboxesgeneratedbyEdgeBoxin
thethirdandforthcolumnsrespectively. 4. EXPERIMENTS
Given thousands of object proposal boxes, on a video
frame, we construct a saliency map through a pixel voting We have conducted experiments on one publicly available
procedure. Eachobjectproposalboxisconsideredasavote videodatasets,namelyJ-HMDB[18],whichconsistsof920
for all pixels located inside it. We normalize all pixel votes videosof21differentactions. Thesevideosareselectedfrom
into[0,1]toformasaliencyprobabilitydistribution.Saliency a larger dataset HMDB [19]. J-HMDB also provides anno-
map examples are illustrated in the fifth column of Fig.2. tatedboundingboxesforactorsoneachframe. Wereportthe
Warmercolorsindicatehighersaliencyprobabilities. average classification accuracy among three training/testing
Based on constructed saliency maps of a video, we are splitsettingsprovidedbyJ-HMDB.
abletoselectivelysampletrajectoriesandrelatedfeatures. If In the following, we evaluate action recognition on J-
thesaliencyprobabilityofthestartingpixelofatrajectoryis HMDB using sampled trajectory features through different
higher than a predefined sampling probability σ, the trajec- methods, and discuss their performance. We also compare
toryandrelatedfeatureswillbesampled. Inexperiments,we obtained accuracies with a few state-of-the-art action recog-
report recognition accuracies for σ with 0.2, 0.4 and 0.6 re- nitionalgorithms.
spectively.
4.1. InfluenceofSamplingStrategies
3.3. SelectiveSamplingviaMotionObjectProposal In addition to three introduced feature sampling methods, to
better understanding trajectory features, we investigate the
AlthoughbystackingboxesgeneratedviaEdgeBoxareable
fourthsamplingmethodusingannotatedboundingboxesfor
to highlight regions in a frame with saliency objects, con-
actors. Wesampletrajectoryfeatures,ifthestartingpointof
structed saliency map may not be suitable for sampling fea-
atrajectorylocatesinsideanannotationbox. Similarstrategy
tures for action recognition. For example, in the last row of
wasproposedin[18],andwenameitasGT.
Fig.2,theopticalflowfield(secondcolumn)clearlyindicates
Figure 3 and 4 plot average classification accuracies over
theregionwithmotionofinterestsforactionrecognitionislo-
all classes for all sampling methods under different sam-
catedaroundactor’sheadandarms,whiletopscoringboxes
pling rates, using the DT feature and iDT feature respec-
andconstructedsaliencymapviaEdgeBoxincorrectlyfocus
tively. In general, through feature sampling, we are able to
on actor’s legs. Thus, in order to incorporate with motion
achieve higher performance than directly using all features,
information, we propose a motion object proposal method,
sincenoisebackgroundfeatureshavebeendiscarded.
namedFusionEdgeBox,whereafusedobjectivenessscoreis
Specifically, for the DT feature, we can see that: 1) tra-
measuredonbothobjectboundariesandmotionboundaries.
jectory features sampled inside annotated bounding boxes,
Thefusionscorefunctionisdefinedas
achieveshigheraccuracythanusingallfeatures. Similarphe-
nomena has been observed in [18] as well which indicates
s =αs +βs (1) DT features located around human body are more important
fusion obj motion
Method J-HMDB Memory(GB)
66
DenseTrajectory[2] 62.88% 5.4
65 ImprovedDenseTrajectory[3] 64.52% 4.2
64 Pengetal.[20]w/iDT 69.03%* 4.2
Gkioxarietal.[21] 62.5% -
Average Accuracy666123 DT ERDdaignsecdBaoromdx20%∼25%6625..f33e33at%%ures 44..35
FusionEdgeBox FusionEdgeBox 65.91% 4.0
60 EdgeBox Random 65.49% 3.4
Random
59 GT iDT EdgeBox 65.32% 3.6
DT FusionEdgeBox 65.11% 3.5
58
500 1000 1500 2000 2500 3000 3500 4000 4500
Features per Video Discard70%∼80%features
Random 59.90% 1.1
Fig.3. AverageaccuraciesusingtheDTfeature. DT EdgeBox 58.51% 1.4
FusionEdgeBox 60.71% 1.4
Random 62.34% 1.3
66
iDT EdgeBox 58.85% 1.2
65 FusionEdgeBox 60.87% 1.3
64 Table1. Comparisontostate-of-the-artsintermsofaverage
Average Accuracy666123 aecnccuordaicnygatencdhfneiaqtuuer,esstaizcek.e*dFItislehveerrvaegcetsora.nadvancedfeature
FusionEdgeBox
60 EdgeBox 4.2. Comparisonstostate-of-the-arts
Random
59 GT
iDT Table 1 shows comparisons of feature sampling methods in
58
500 1000 1500 2000 2500 3000 3500 different sampling rates with the state-of-the-arts. Sampling
Features per Video
methodsachievebetteraverageaccuraciesthanafewstate-of-
the-arts using same classification pipeline, with ∼ 20% less
Fig.4. AverageaccuraciesusingtheiDTfeature.
features.Itisinterestingtoobservethat,evendiscardingmore
than 70% features, random sampling and proposed selective
samplingstillareabletoremaincomparableperformance.
than features extracted on other regions. 2) Selective sam-
pling methods achieve higher accuracies than random sam-
plinggivensimilarnumberofsampledfeatures. Itshowsthat
samplingDTfeaturesfromcertainregionsisimportantforac-
5. CONCLUSIONS
tionrecognition,andobjectproposalbasedstrategiesareable
to detect these regions. 3) Proposed selective sampling via
motionobjectproposaloutperformsothersamplingmethods, Inthiswork, wefocusonfeaturesamplingstrategiesforac-
evenoutperformstheonebasedonannotatedboundingboxes. tion recognition in videos. Dense trajectory features are uti-
ItverifiesthatproposedFusionEdgeBoxmethodisusefulfor lizedtorepresentvideos.Twotypesofsamplingstrategiesare
exploringregionsofinterestsforactionrecognition. investigated, namely uniformly random sampling and selec-
For the iDT feature, however, different sampling method tivesampling. Weproposetouseobjectproposaltechniques
result in similar accuracies. Random sampling outperforms toconstructsaliencymapsforvideoframes,andusethemto
others slightly, especially when the number of sampled fea- guide the selective feature sampling process. We also pro-
turesissmall. Thereasonmaybethat, byeliminatingback- poseamotionobjectproposalmethodthatincorporateobject
groundmotioncausedbythecameramovement,theiDTfea- motioninformationintoobjectproposalframework. Experi-
ture is more compact and meaningful than the DT feature, ments conducted on a large video dataset indicate that sam-
e.x., the average number of iDT features per video is much pling based methods are able to achieve better recognition
lowerthanitofDTfeature. Randomsamplingisabletobet- accuracyusing25%lessfeaturesthroughoneofproposedse-
terpreservetheoriginaliDTfeaturedistributionthanselective lectivefeaturesamplingmethod,andevenremaincomparable
samplingswhichhavequitelargesamplingbias. accuracywithdiscarding70%features.
6. REFERENCES [15] Florent Perronnin, Jorge Sa´nchez, and Thomas
Mensink, “Improving the fisher kernel for large-scale
[1] Xiaojiang Peng, Limin Wang, Xingxing Wang, and imageclassification,” inECCV,2010.
Yu Qiao, “Bag of visual words and fusion methods
[16] Chih-Chung Chang and Chih-Jen Lin, “Libsvm: a li-
for action recognition: Comprehensive study and good
practice,” CoRR,vol.abs/1405.4506,2014. braryforsupportvectormachines,” ACMTransactions
on Intelligent Systems and Technology, vol. 2, pp. 27,
[2] Heng Wang, Alexander Klaser, Cordelia Schmid, and 2011.
Cheng-Lin Liu, “Action recognition by dense trajecto-
[17] Piotr Dolla´r and C. Lawrence Zitnick, “Structured
ries,” inCVPR,2011.
forestsforfastedgedetection,” inICCV,2013.
[3] HengWangandCordeliaSchmid, “Actionrecognition
[18] Hueihan Jhuang, Juergen Gall, Silvia Zuffi, Cordelia
withimprovedtrajectories,” inICCV,2013.
Schmid, and Michael J. Black, “Towards understand-
ingactionrecognition,” inICCV,2013.
[4] Michael Sapienza, Fabio Cuzzolin, and Philip H. S.
Torr, “Feature sampling and partitioning for visual
[19] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and
vocabulary generation on large action classification
T. Serre, “Hmdb: A large video database for human
datasets,” CoRR,vol.abs/1405.7545,2014. motionrecognition,” inICCV,2011.
[5] FengShi,EmilPetriu,andRobertLaganiere,“Sampling [20] Xiaojiang Peng, Changqing Zou, Yu Qiao, and Qiang
strategies for real-time action recognition,” in CVPR, Peng, “Actionrecognitionwithstackedfishervectors,”
2013. inECCV,2014.
[6] VadimKantorovandIvanLaptev, “Efficientfeatureex- [21] Georgia Gkioxari and Jitendra Malik, “Finding action
traction,encodingandclassificationforactionrecogni- tubes,” CoRR,vol.abs/1411.6031,2014.
tion,” inCVPR,2014.
[7] StefanMatheandCristianSminchisescu,“Dynamiceye
movementdatasetsandlearntsaliencymodelsforvisual
actionrecognition,” inECCV,2012.
[8] Eleonora Vig, Michael Dorr, and David Cox, “Space-
variantdescriptorsamplingforactionrecognitionbased
onsaliencyandeyemovements,” inECCV,2012.
[9] HongliangLi,FanmanMeng,andKingNgiNgan,“Co-
salient object detection from multiple images,” IEEE
Transactions on Multimedia, vol. 15, pp. 1896–1909,
2013.
[10] CLawrence.ZitnickandPiotrDolla´r,“Edgeboxes:Lo-
catingobjectproposalsfromedges,” inECCV,2014.
[11] Jan Hosang, Rodrigo Benenson, and Bernt Schiele,
“Howgoodaredetectionproposals,really?,” inBMVC,
2014.
[12] Gunnar Farneba¨ck, “Two-frame motion estimation
basedonpolynomialexpansion,” inScandinavianCon-
ferenceonImageAnalysis,2003.
[13] NavneetDalalandBillTriggs, “Histogramsoforiented
gradientsforhumandetection,” inCVPR,2005.
[14] NavneetDalal,BillTriggs,andCordeliaSchmid, “Hu-
mandetectionusingorientedhistogramsofflowandap-
pearance,” inECCV,2006.