Table Of Content

sensors Article Segment-Tube: Spatio-Temporal Action Localization in Untrimmed Videos with Per-Frame Segmentation LeWang1,* ID,XuhuanDuan1,QilinZhang2 ID,ZhenxingNiu3,GangHua4 andNanningZheng1 1 InstituteofArtificialIntelligenceandRobotics,Xi’anJiaotongUniversity,Xi’an,Shannxi710049,China; [email protected](X.D.);[email protected](N.Z.) 2 HERETechnologies,Chicago,IL60606,USA;[email protected] 3 AlibabaGroup,Hangzhou311121,China;[email protected] 4 MicrosoftResearch,Redmond,WA98052,USA;[email protected] * Correspondence:[email protected];Tel.:+86-29-8266-8672 (cid:1)(cid:2)(cid:3)(cid:1)(cid:4)(cid:5)(cid:6)(cid:7)(cid:8)(cid:1) (cid:1)(cid:2)(cid:3)(cid:4)(cid:5)(cid:6)(cid:7) Received:23April2018;Accepted:16May2018;Published:22May2018 Abstract: Inspiredbytherecentspatio-temporalactionlocalizationeffortswithtubelets(sequences ofboundingboxes),wepresentanewspatio-temporalactionlocalizationdetectorSegment-tube, whichconsistsofsequencesofper-framesegmentationmasks. TheproposedSegment-tubedetector can temporally pinpoint the starting/ending frame of each action category in the presence of preceding/subsequentinterferenceactionsinuntrimmedvideos. Simultaneously,theSegment-tube detectorproducesper-framesegmentationmasksinsteadofboundingboxes,offeringsuperiorspatial accuracytotubelets. Thisisachievedbyalternatingiterativeoptimizationbetweentemporalaction localizationandspatialactionsegmentation. Experimentalresultsonthreedatasetsvalidatedthe efficacyoftheproposedmethod,including(1)temporalactionlocalizationontheTHUMOS2014 dataset;(2)spatialactionsegmentationontheSegtrackdataset;and(3)jointspatio-temporalaction localizationonthenewlyproposedActSegdataset. Itisshownthatourmethodcomparesfavorably withexistingstate-of-the-artmethods. Keywords: actionlocalization;actionsegmentation;3DConvNets;LSTM 1. Introduction Jointspatio-temporalactionlocalizationhasattractedsignificantattentioninrecentyears[1–18], whose objectives include action classification (determining whether a specific action is present), temporallocalization(pinpointingthestarting/endingframeofthespecificaction)andspatio-temporal localization(typicallyboundingboxregressionon2Dframes,e.g.,[6,12]). Sucheffortsincludelocal featurebasedmethods[1],convolutionneuralnetworks(ConvNetsorCNNs)basedmethods[2,14,15], 3DConvNetsbasedmethods[4,11]anditsvariants[19–21]. Recently,longshort-termmemory(LSTM) basedrecurrentneuralnetworks(RNNs)areaddedontopofCNNsforactionclassification[5]and actionlocalization[7]. Despite the successes of the prior methods, there are still several limiting factors impeding practicalapplications.Ontheonehand,alargenumberofmethods[2,3,5,13]conductactionrecognition onlyontrimmedvideos,whereeachvideocontainsonlyoneactionwithoutinterferencesfromother potentiallyconfusingactions. Ontheotherhand,manymethods[1,7–11,15–17]emphasizeonlyon temporalactionlocalizationwithuntrimmedvideos,withoutdepictingthespatiallocationsofthe targetactionineachvideoframe. Although there are several tubelet-style (which outputs sequences of bounding boxes) spatio-temporal action localization efforts [6,12,22], they are restricted to trimmed video only. Sensors2018,18,1657;doi:10.3390/s18051657 www.mdpi.com/journal/sensors Sensors2018,18,1657 2of20 Forpracticalapplications,untrimmedvideosaremuchmoreprevalent,andsequencesofbounding boxes might not offer enough spatial accuracy, especially for irregular shapes. This motivated us toproposeapracticalspatio-temporalactionlocalizationmethod,whichiscapableofspatiallyand temporallylocalizingthetargetactionswithper-framesegmentationinuntrimmedvideos. Withapplicationsinuntrimmedvideoswithimprovedspatialaccuracyinmind,weproposethe spatio-temporalactionlocalizationdetectorSegment-tube,whichlocalizestargetactionsassequences ofper-framesegmentationmasksinsteadofsequencesofboundingboxes. TheproposedSegment-tubedetectorisillustratedinFigure1. Thesampleinputisanuntrimmed videocontainingallframesinapairfigureskatingvideo,withonlyaportionoftheseframesbelonging toarelevantcategory(e.g.,theDeathSpirals). Initializedwithsaliency[23]basedimagesegmentation onindividualframes,ourmethodfirstperformstemporalactionlocalizationstepwithacascaded 3D ConvNets [4] and LSTM, and pinpoints the starting frame and the ending frame of a target actionwithacoarse-to-finestrategy.Subsequently,theSegment-tubedetectorrefinesper-framespatial segmentationwithgraphcut[24]byfocusingonrelevantframesidentifiedbythetemporalaction localizationstep. Theoptimizationalternatesbetweenthetemporalactionlocalizationandspatial actionsegmentationinaniterativemanner. Uponpracticalconvergence,thefinalspatio-temporal action localization results are obtained in the format of a sequence of per-frame segmentation masks(bottomrowinFigure1)withprecisestarting/endingframes.Intuitively,thetemporalaction localizationandspatialactionsegmentationnaturallybenefiteachother. tupn ... ... ... ... ... ... ... ... ... I e Spatio-temporal Action Localization b u tn-t Temporal Action Localization Spatial Action Segmentation e m g e S Iterative Optimization tup ... ... ... ... ... ... ... ... ... tu O Background DeathSpirals Background Figure1.Flowchartoftheproposedspatio-temporalactionlocalizationdetectorSegment-tube.Asthe input,anuntrimmedvideocontainsmultipleframesofactions(e.g.,allactionsinapairfigureskating video),withonlyaportionoftheseframesbelongingtoarelevantcategory(e.g.,theDeathSpirals). There are usually irrelevant preceding and subsequent actions (background). The Segment-tube detector alternates the optimization of temporal localization and spatial segmentation iteratively. Thefinaloutputisasequenceofper-framesegmentationmaskswithprecisestarting/endingframes denotedwiththeredchunkatthebottom,whilethebackgroundaremarkedwithgreenchunksat thebottom. We conduct experimental evaluations (in both qualitative and quantitative measures) of the proposedSegment-tubedetectorandexistingstate-of-the-artmethodsonthreebenchmarkdatasets, Sensors2018,18,1657 3of20 including (1) temporal action localization on the THUMOS 2014 dataset [25]; (2) spatial action segmentation on the SegTrack dataset [26,27]; and (3) joint spatio-temporal action localization on thenewlyproposedActSegdataset,whichisanewlyproposedspatio-temporalactionlocalization datasetwithper-framegroundtruthsegmentationmasks,anditwillbereleasedonourprojectwebsite. TheexperimentalresultsshowtheperformanceadvantageoftheproposedSegment-tubedetectorand validateitsefficacyinspatio-temporalactionlocalizationwithper-framesegmentation. Insummary,thecontributionsofthispaperareasfollows: • Thespatio-temporalactionlocalizationdetectorSegment-tubeisproposedforuntrimmedvideos, whichproducesnotonlythestarting/endingframeofanaction,butalsoper-framesegmentation masksinsteadofsequencesofboundingboxes. • TheproposedSegment-tubedetectorachievescollaborativeoptimizationoftemporallocalization and spatial segmentation with a new iterative alternation approach, where the temporal localizationisachievedbyacoarse-to-finestrategybasedoncascaded3DConvNets[4]andLSTM. • ToexactlyevaluatetheproposedSegment-tubeandtobuildabenchmarkforfutureresearch, a new ActSeg dataset is proposed, which consists 641 videos with temporal annotations and per-framegroundtruthsegmentationmasks. Theremainderofthepaperisorganizedasfollows. InSection2,wereviewtherelatedwork. InSection3,wepresenttheproblemformulationforspatio-temporalactionlocalizationwithper-frame segmentation. InSection4,theexperimentalresultsarepresentedwithadditionaldiscussions. Finally, thepaperisconcludedinSection5. 2. RelatedWorks Thejointspatio-temporalactionlocalizationprobleminvolvesthreedistinctivetaskssimultaneously, i.e., action classification, temporal action localization, and spatio-temporal action localization. Briefreviewsofrelatedworksonthesethreetopicsarefirstprovided. Inaddition,relevantworksin videoobjectsegmentationarealsointroducedinthissection. 2.1. ActionClassification Theobjectiveofactionclassificationistodeterminethepresenceofaspecificaction(e.g.,jumpand polevault)inavideo. Aconsiderableamountofpreviouseffortsarelimitedtoactionclassificationin manuallytrimmedshortvideos[2,3,5,13,28,29],whereeachvideoclipcontainsoneandonlyoneaction, withoutpossibleinterferencesfromeitherproceeding/subsequentactionsorcomplexbackground. Many methods [1] rely on handcrafted local invariant features, such as histograms of image gradients (HOG) [30], histograms of flow (HOF) [31] and improved Dense Trajectory (iDT) [28]. VideorepresentationsaretypicallybuiltontopofthesefeaturesbytheFisherVector(FV)[32]orVector ofLinearlyAggregatedDescriptors(VLAD)[33]todetermineactioncategories. Recently,CNNsbased methods [2,14,15] have enabled the replacement of handcrafted features with learned features, andtheyhaveachievedimpressiveclassificationperformance. 3DConvNetsbasedmethods[4,19–21] are also proposed to construct spatio-temporal features. Tran et al. [4] demonstrated that 3D ConvNetsaregoodfeaturelearningmachinesthatmodelappearanceandmotionsimultaneously. Carreiraetal.[19] proposed a new two-stream Inflated 3D ConvNet (I3D) architecture for action classification. Haraetal.[21] discovered that 3D architectures (two-stream I3D/ResNet/ResNeXt) pre-trainedonKineticsdatasetoutperformcomplex2Darchitectures. Subsequently,longshort-term memory(LSTM)-basedrecurrentneuralnetworks(RNNs)areaddedontopofCNNstoincorporate longertermtemporalinformationandbetterclassifyvideosequences[5,7]. 2.2. TemporalActionLocalization Temporal action localization aims at pinpointing the starting and ending frames of a specific action in a video. Much progress has been made recently, thanks to plenty of large-scale Sensors2018,18,1657 4of20 datasets including the THUMOS dataset [25], the ActivityNet dataset [34], and the MEXaction2 dataset[35]. Moststate-of-the-artmethodsarebasedonslidingwindows[1,11,32,36,37],frame-wise predictions[7,15,38,39],oractionproposals[22,40,41]. Theslidingwindow-basedmethodstypicallyexploitfixed-lengthtemporallyslidingwindowsto sample each video. They can leverage the temporal dependencies among video frames, but they commonly lead to higher computational cost due to redundancies in overlapping windows. Gaidonetal.[36] used sliding window classifiers to locate action parts (actoms) from a sequence of histograms of actom-anchored visual features. Oneata et al. [32] and Yuan et al. [37] both used sliding window classifiers on FV representations of iDT features. Shou et al. [11] proposed aslidingwindow-style3DConvNetforactionlocalizationwithoutrelyingonhand-craftedfeaturesor FV/VLADrepresentations. Theframe-wisepredictions-basedmethodsclassifieseachindividualvideoframe(i.e.,predicts whether a specific category of action is present), and aggregate such predictions temporally. Singhetal.[38] used a frame-wise classifier for action location proposal, followed by a temporal aggregationstepthatpromotespiecewisesmoothnessinsuchproposals. Yuanetal.[15]proposed to characterize the temporal evolution as a structural maximal sum of frame-wise classification scores. Toaccountforthedynamicsamongvideoframes,RNNswithLSTMaretypicallyemployed. In[7,39],anLSTMproduceddetectionscoresofactivitiesandnon-activitiesbasedonCNNfeaturesat everyframe.AlthoughsuchRNNscanexploittemporalstatetransitionsoverframesforframe-wise predictions, their inputs are frame-level CNN features computed independently on each frame. On contrary in this paper, we leverage 3D ConvNets with LSTM to capture the spatio-temporal informationfromadjacentframes. Theactionproposals-basedmethodsleveragetemporalactionproposalsinsteadofvideoclipsfor efficientactionlocalization. Jainetal.[22]producedtubelets(i.e.,2D+tsequencesofboundingboxes) bymergingahierarchyofsuper-voxels. YuandYuan[40]proposedtheactionnessscoreandagreedy searchstrategytogenerateactionproposals. Buchetal.[41]introducedatemporalactionproposals generationframeworkthatonlyneedstoprocesstheentirevideoinasinglepass. 2.3. Spatio-TemporalActionLocalization Therearemanypublicationsaboutthespatio-temporalactionlocalizationproblem[6,12,42–45]. Soomro et al. [43] proposed a method based on super-voxel. Several methods [6,44] formulated spatio-temporalactionlocalizationasatrackingproblemwithobjectproposaldetectionateachvideo frameandsequencesofboundingboxesasoutputs. Kalogeitonetal.[12]proposedanactiontubelet detectorthattakesasequenceofframesasinputandproducessequencesofboundingboxeswith improved action scores as outputs. Singh et al. [45] presented an online learning framework for spatio-temporalactionlocalizationandprediction.Despitetheirsuccesses,alltheaforementioned spatio-temporal action localization methods require trimmed videos as inputs, and only output tubelet-styleboundariesofanaction,i.e.,sequencesofboundingboxes. In contrast, we propose the spatio-temporal action localization detector Segment-tube for untrimmedvideos,whichcanprovideper-framesegmentationmasksinsteadofsequencesofbounding boxes. Moreover, to facilitate the training of the proposed Segment-tube detector and to establish abenchmarkforfutureresearch,weintroduceanewuntrimmedvideodatasetforactionlocalization and segmentation (i.e., ActSeg dataset), with temporal annotations and per-frame ground truth segmentationmasks. 2.4. VideoObjectSegmentation Video object segmentation aims at separating the object of interest from the background throughoutallvideoframes. Previousvideoobjectsegmentationmethodscanberoughlycategorized intotheunsupervisedmethodsandthesupervisedcounterparts. Sensors2018,18,1657 5of20 Without requiring labels/annotations, unsupervised video object segmentation methods typically exploit features such as long-range point trajectories [46], motion characteristics [47], appearance[48,49], or saliency [50]. Recently, Jain et al. [51] proposed an end-to-end learning framework which combines motion and appearance information to produce a pixel-wise binary segmentationmaskforeachframe. Differently, supervised video object segmentation methods do require user annotations of a primary object ( i.e., the foreground), and the prevailing methods are based on label propagation[52,53]. Forexample,Markietal.[52]utilizethesegmentationmaskofthefirstframeto constructappearancemodels,andtheinferenceforsubsequentframesareobtainedbyoptimizingan energyfunctiononaregularlysampledbilateralgrid.Caellesetal.[54]adoptedtheFullyConvolutional Networks(FCNs)totacklevideoobjectsegmentation,giventhesegmentationmaskforthefirstframe. However, all the above video object segmentation methods assume that the object of interest (or primary object) consistently appears throughout all video frames, which is reasonable for manually trimmed video dataset. On the contrary, for practical applications with user-generated, noisyuntrimmedvideos,thisassumptionseldomholdstrue. Fortunately,theproposedSegment-tube detectoreliminatessuchastrongassumption,anditisrobusttoirrelevantvideoframesandcanbe utilizedtoprocessuntrimmedvideos. 3. ProblemFormulation GivenavideoV = {f }T consistingofTframes,ourobjectiveistodeterminewhetheraspecific t t=1 actionk ∈ {1,...,K}appearsinV,andifso,temporallypinpointthestartingframe f (k)andending s frame f (k)foractionk.Simultaneously,asequenceofsegmentationmasksB = {b }fe(k) withinsuch e t t=fs(k) framerangeshouldbeobtained,withb beingabinarysegmentationlabelforframe f . Practically, t t b consistsofaseriesofsuperpixelsb = {b }Nt ,with N beingthetotalnumberofsuperpixelsin t t t,i i=1 t frame f . t 3.1. TemporalActionLocalization A coarse-to-fine action localization strategy is implemented to accurately find the temporal boundariesofthetargetactionkfromanuntrimmedvideo,asillustratedinFigure2. Thisisachieved byacascaded3DConvNetswithLSTM.The3DConvNets[4]consistsofeight3Dconvolutionlayers, five3Dpoolinglayers,andtwofullyconnectedlayers. Thefully-connected7thlayeractivationfeature isusedtorepresentthevideoclip. Toexploitthetemporalcorrelations,weincorporateatwo-layer LSTM[5]usingthePeepholeimplementation(with256hiddenstatesineachlayer)with3DConvNets. CoarseActionLocalization. Thecoarseactionlocalizationdeterminestheapproximatetemporal boundarieswithafixedstep-size(i.e.,videocliplength). WefirstgenerateasetofUsaliency-aware videoclips{u }U withvariable-length(e.g.,16and32framespervideoclip)slidingwindowwith j j=1 75%overlapratioontheinitialsegmentationB ofvideoV (byusingsaliency[23]),andproceedto o trainacascaded3DConvNetswithLSTMthatcouplesaproposalnetworkandaclassificationnetwork. The proposal network is action class-agnostic, and it determines whether any actions (∀k ∈ {1,...,K})arepresentinvideoclipu . Theclassificationnetworkdetermineswhetheraspecific j actionkispresentinvideoclipu . Wefollow[11]toconstructthetrainingdatafromthesevideoclips. j Thetrainingdetailsoftheproposalnetworkandclassificationnetworkarepresentedimmediately belowinSection4.2. Specifically,wetraintheproposalnetwork(a3DConvNetswithLSTM)toscoreeachvideoclip (cid:104) (cid:105)T u withaproposalscoreppro = ppro(1),ppro(0) ∈ R2. Subsequently,aflaglabellfla isobtainedfor j j j j j eachvideoclipu , j  1, if ppro(1) >ppro(0), lfla = j j (1) j 0, otherwise, Sensors2018,18,1657 6of20 wherelfla =1denotesthevideoclipu containsanaction(∀k ∈ {1,...,K}),andlfla =0otherwise. j j j (a) Coarse localization (b) Fine localization Ground C3D LSTM Prediction yconeediliavs e truapwnaI- wodniw gnidilds eeilraacVs- 11336622 ff ...rrffaarraammmmeeeessss noita claifsiospsaolrCP kkrroowwtteenn CCCCC33333......DDDDD ...... LLLLLSSSSS......TTTTTMMMMM pppppcccjjjppjjlllrraaa...oo(((((K0110))))) trsdlanruipoSrghktaceaDButh D e ②① a t h l StbaepboslaripShtaeDmieurlapn l odsra ar√√ly Figure2.Overviewoftheproposedcoarse-to-finetemporalactionlocalization.(a)coarselocalization. Givenanuntrimmedvideo,wefirstgeneratesaliency-awarevideoclipsviavariable-lengthsliding windows.Theproposalnetworkdecideswhetheravideoclipcontainsanyactions(sotheclipisadded tothecandidateset)orpurebackground(sotheclipisdirectlydiscarded).Thesubsequentclassification networkpredictsthespecificactionclassforeachcandidateclipandoutputstheclassificationscores andactionlabels.(b)finelocalization.Withtheclassificationscoresandactionlabelsfrompriorcoarse localization,furtherpredictionofthevideocategoryiscarriedoutanditsstartingandendingframes areobtained. Aclassificationnetwork(alsoa3DConvNetswithLSTM)isfurthertrainedtopredicta(K+1)- (cid:110) (cid:111) dimensionalclassificationscorepclaforeachclipthatcontainsanaction u |lfla =1 ,basedonwhich j j j aspecificactionlabellspe ∈ {k}K andscorevspe ∈ [0,1]foru areassigned, j k=0 j j lspe = arg max pcla(k), (2) j j k=0,...,K vspe = max pcla(k). (3) j j k=0,...,K where category 0 denotes the additional “background” category. Although the proposal network prefilters most “background” clips, a background category is still needed for robustness in the classificationnetwork. spe spe FineActionLocalization. Withtheobtainedper-clipspecificactionlabelsl andv ,thefine j j actionlocalizationsteppredictsthevideocategoryk∗ (k∗ ∈ {1,...,K}),andsubsequentlyobtainsits startingframe f (k∗)anditsendingframe f (k∗). Wecalculatetheaverageofspecificactionscoresvspe s e j overallvideoclipsforeachspecificactionlabellspe,andtakethelabelk∗ withthemaximumaverage j predictedscoreasthepredictedaction,asillustratedinFigure3. Subsequently,weaveragespecificactionscoresvspe ofeachframe f forthelabelk∗ indifferent j t videoclipstoobtaintheactionscoreα (f )forframe f . Byselectinganappropriatethresholdwecan t t t obtaintheactionlabel l forframe f . Theactionscore α (f |k∗) andtheactionlabel l forframe f t t t t t t specificallyaredeterminedby ∑ vspe j αt(ft|k∗) = j∈{j|f(cid:8)t∈uj} (cid:9) , (4) |{j|j ∈ f ∈ u }| t j Sensors2018,18,1657 7of20 (cid:40) k∗, if α > γ, t l = (5) t 0, otherwise, where|{·}|denotesthecardinalityofset{·}. Weempiricallysetγ =0.6. f (l )and f (l )areassigned s t e t asthestartingandendingframeofaseriesofconsecutiveframessharingthesamelabell ,respectively. t d n uo 1 2 K rgk no no ... no caB itcA itcA itcA Video clip 1 Video clip 2 Video clip 3 . . . Video clip U Average Max (k* =2) Figure3.Thediagrammaticsketchonthedeterminationofvideocategoryk∗fromvideoclips. 3.2. SpatialActionSegmentation Withtheobtainedtemporallocalizationresults,wefurtherconductspatialactionsegmentation. Thisproblemiscastintoaspatio-temporalenergyminimizationframework, E(B) = ∑ D(b )+ ∑ Sintra(b ,b )+ ∑ Sinter(b ,b ), (6) i t,i in t,i t,n in t,i m,n st,i∈V st,i,st,n∈Ni st,i,sm,n∈N¯i where s is the ith superpixel in frame f . D(b ) composes the data term, denoting the cost of t,i t i t,i labelings withthelabelb fromacolorandlocationbasedappearancemodel. Sintra(b ,b )and t,i t,i in t,i t,n Sinter(b ,b ) composethesmoothnessterm, constrainingthesegmentationlabelstobespatially in t,i m,n coherencefromacolorbasedintra-frameconsistencymodel,andtemporallyconsistentfromacolor basedinter-frameconsistencymodel,respectively. N isthespatialneighborhoodofs inframe f . i t,i t N¯i isthetemporalneighborhoodofst,i inadjacentframes ft−1and ft+1. Wecomputethesuperpixels byusingSLIC[55],duetoitssuperiorityintermsofadherencetoboundaries,aswellascomputational andmemoryefficiency. However,theproposedmethodisnottiedtoanyspecificsuperpixelmethod, andonecanchooseothers. DataTerm. ThedatatermD(b )definesthecostofassigningsuperpixels withlabelb from i t,i t,i t,i anappearancemodel,whichlearnsthecolorandlocationdistributionsoftheactionobjectandthe backgrounds of video V. With a segmentation B for V, we estimate two color Gaussian Mixture Sensors2018,18,1657 8of20 Models(GMMs)andtwolocationGMMsfortheforegroundsandthebackgroundsofV,respectively. ThecorrespondingdatatermD(b )basedoncolorandlocationGMMsinEquation(6)isdefinedas i t,i (cid:16) (cid:17) D(b ) = −log βhcol(s )+(1−β)hloc(s ) , (7) i t,i bt,i t,i bt,i t,i wherehcol denotesthetwocolorGMMs,i.e.,hcol fortheactionobjectandhcol forthebackground bt,i 1 0 acrossvideoV. Similarly,hlocdenotesthetwolocationGMMsfortheactionobjectandthebackground bt,i acrossV,i.e.,hloc andhloc. βisaparametercontrollingthecontributionsofcolorhcol andlocationhloc. 1 0 bt,i bt,i Smoothness Term. The action segmentation labeling B should be spatially consistent in each frame,andmeanwhiletemporallyconsistentthroughoutvideoV. Thus,wedefinethesmoothness termbyassemblinganintra-frameconsistencymodelandaninter-frameconsistencymodel. Theintra-frameconsistencymodelenforcesthespatiallyadjacentsuperpixelsinthesameaction frametohavethesamelabel. Duetothefactthattheadjacentsuperpixelseitherhavesimilarcoloror distinctcolorcontrast[56],thewell-knownstandardcontrast-dependentfunction[56,57]isexploited toencouragethespatiallyadjacentsuperpixelswithsimilarcolortobeassignedwiththesamelabel. Then,Sintra(b ,b )inEquation(6)isdefinedas iu t,i t,n Sintra(b ,b ) =1 exp(−||c −c ||2), (8) in t,i t,n [bt,i(cid:54)=bt,n] t,i t,n 2 wherethecharacteristicfunction1 = 1whenb (cid:54)= b ,and0otherwise. b andb arethe [bt,i(cid:54)=bt,n] t,i t,n t,i t,n segmentationlabelsofsuperpixelss ands ,respectively. cisthecolorvectorofthesuperpixel. t,i t,n Theinter-frameconsistencymodelencouragesthetemporallyadjacentsuperpixelsinconsecutive actionframestohavethesamelabel. Asthetemporallyadjacentsuperpixelsshouldhavesimilarcolor andmotion,weusetheEuclideandistancebetweenthemotiondistributionsoftemporallyadjacent superpixelsalongwiththeabovecontrast-dependentfunctioninEquation(8)toconstrainthelabelsof themtobeconsistent. InEquation(6),Sinter(b ,b )isthendefinedas in t,i m,n Sinter(b ,b ) =1 (exp(−||c −c ||2)+exp(−||hm −hm || )), (9) in t,i m,n [bt,i(cid:54)=bm,n] t,i m,n 2 t,i m,n 2 wherehm isthehistogramoforientedopticalflow(HOOF)[58]ofthesuperpixel. Optimization. With D(b ), Sintra(b ,b ) and Sinter(b ,b ), we leverage graph cut [24] to i t,i in t,i t,n in t,i m,n minimizetheenergyfunctioninEquation(6),andcanobtainanewsegmentationBforvideoV. 3.3. IterativeandAlternatingOptimization With an initial spatial segmentation B of video V using saliency [23], the temporal action o localizationfirstpinpointsthestartingframe f (k)andtheendingframe f e(k)ofatargetaction k s s fromanuntrimmedvideoV byacoarse-to-fineactionlocalizationstrategy,andthenthespatialaction segmentationfurtherproducesthespatialper-framesegmentationBbyfocusingontheactionframes identifiedbythetemporalactionlocalization. WiththenewsegmentationBofvideoV,theoverall optimization alternates between the temporal action localization and spatial action segmentation. Uponthepracticalconvergenceofthisiterativeprocess,thefinalresults Bareobtained. Naturally, thetemporalactionlocalizationandspatialactionsegmentationbenefiteachother. Intheexperiments, we terminate the iterative optimization after practical convergence is observed, i.e., the relative variationbetweentwosuccessivespatio-temporalactionlocalizationresultsaresmallerthan0.001. 4. ExperimentsandDiscussion 4.1. DatasetsandEvaluationProtocol Weconductextensiveexperimentsonmultipledatasetstoevaluatetheefficacyoftheproposed spatio-temporalactionlocalizationdetectorSegment-tube,including(1)temporalactionlocalization Sensors2018,18,1657 9of20 taskontheTHUMOS2014dataset[25];(2)spatialactionsegmentationontheSegTrackdataset[26,27]; and(3)spatio-temporalactionlocalizationtaskonthenewlyproposedActSegdataset. Theaverageprecision(AP)andmeanaverageprecision(mAP)areemployedtoevaluatethe temporalactionlocalizationperformance. Ifanactionisassignedthesamecategorylabelwiththe groundtruth,and,simultaneously,itspredictedtemporalrangeoverlapsthegroundtruthataratio aboveapredefinedthreshold(e.g.,0.5). Suchtemporallocalizationofanactionisdeemedcorrect. Theintersection-over-union(IoU)valueisutilizedtoevaluatethespatialactionsegmentation performance,anditisdefinedas |Seg∩GT| IoU= , (10) |Seg∪GT| where Seg denotes the binary segmentation result obtained by a detector, GT denotes the binary groundtruthsegmentationmask,and|·|denotesthecardinality(i.e.,pixelcount). 4.2. ImplementationDetails Training the proposal network. The proposal network is to predict each video clip u either j containsanaction(lfla =1)orthebackground(lfla =0),andthuscanremovethebackgroundvideo j j clips,asdescribedinSection3.1. Webuildthetrainingdataasfollowstotraintheproposalnetwork. Foreachvideoclipfromtrimmedvideos,weassignitsactionlabelas1,denotingitcontainssome actionk (∀k ∈ {1,...,K}). Foreachvideoclipfromuntrimmedvideoswithtemporalannotations, wesetitslabelbyusingtheIoUvaluebetweenitandthegroundtruthactioninstances. IftheIoU valueishigherthan0.75,weassignthelabelas1,denotingthatitcontainsanaction;iftheIoUvalueis lowerthan0.25,weassignthelabelas0,denotingthatitdoesnotcontainanaction. The3DConvNets[4]components(asshowninFigure2)arepre-trainedonthetrainingsplitof theSports-1Mdataset[59],andusedastheinitializationsofourproposalandclassificationnetworks. Theoutputofthesoftmaxlayerintheproposalnetworkisoftwodimensions,whichcorrespondsto eitheranactionorthebackground. Inallthefollowingexperiments,thebatchsizeisfixedat40during thetrainingphase,andtheinitiallearningrateissetat10−4 withalearningratedecayoffactor10 every10Kiterations. For the LSTM component, the activation feature of the fully-connected 7th layer of the 3D ConvNets[4]isusedastheinputtotheLSTM.Thelearningbatchsizeissettobe32,whereeachsample intheminibatchisasequenceoften16-framevideoclips.WeuseRMSprop[60]withalearningrateof 10−4,amomentumof0.9andaweightdecayfactorof5×10−4.Thenumberofiterationsdependson thesizeofthedataset,andwillbeelaboratedinthefollowingtemporalactionlocalizationexperiments. Trainingtheclassificationnetwork. Theclassificationnetworkistofurtherpredictwhethereach videoclipu containsaspecificaction(lspe ∈ {k}K )ornot,asdescribedinSection3.1. Thetraining j j k=0 datafortheclassificationnetworkisbuiltsimilarlytothatoftheproposalnetwork. Theonlydifference is that, for the saliency-aware positive video clip, we assign its label as a specific action category k ∈ {1,...,K}(e.g.,“LongJump”),insteadof1fortrainingtheaboveproposalnetwork. Astothe3DConvNets[4](seeFigure2),wetrainaclassificationmodelwithKactionsplusone additional“background”category. Thelearningbatchsizeisfixedat40,theinitiallearningrateis 10−4andthelearningrateisdividedby2afterevery10Kiterations. TotraintheLSTM,theactivationfeatureofthefully-connected7thlayerofthe3DConvNets[4]is fedtotheLSTM.Wefixthelearningbatchsizeat32,whereeachsampleintheminibatchisasequence often16-framevideoclips. WealsouseRMSprop[60]withalearningrateof10−4,amomentumof 0.9andaweightdecayfactorof5×10−4. 4.3. TemporalActionLocalizationontheTHUMOS2014Dataset WefirstevaluatethetemporalactionlocalizationperformanceoftheproposedSegment-tube detectorontheTHUMOS2014dataset[25],whichisdedicatedtolocalizingactionsinlonguntrimmed videos involving 20 actions. The training set contains 2755 trimmed videos and 1010 untrimmed Sensors2018,18,1657 10of20 validationvideos. Forthe3DConvNetstraining,thefine-tuningstopsat30kforthetwonetworks. FortheLSTMtraining,thenumberoftrainingiterationsis20kfortwonetworks. Fortesting,weuse 213untrimmedvideosthatcontainrelevantactioninstances. Fiveexistingtemporalactionlocalizationmethods,i.e.,AMA[1],FTAP[9],ASLM[10],SCNN[11], andASMS[15],areincludedascompetingalgorithms.AMA[1]combinesiDTfeaturesandframe-level CNN features to train a SVM classifier. FTAP [9] leverages high recall temporal action proposals. ASLM [10] uses a length and language model based on traditional motion features. SCNN [11] is an end-to-end segment-based 3D ConvNets framework, including proposal, classification and localizationnetwork. ASMS[15]localizesactionsbysearchingforthestructuredmaximalsum. The mAP comparisons are summarized in Table 1, which demonstrate that the proposed Segment-tubedetectorevidentlyoutperformsthefivecompetingalgorithmswithIoUbeing0.3and0.5, andismarginallyinferiortoSCNN[11]withIoUthresholdbeing0.4. Wealsopresentthequalitative temporalactionlocalizationresultsoftheproposedSegment-tubedetectorfortwoactioninstancesof thetestingsplitfromtheTHUMOS2014datasetinFigure4,withIoUthresholdbeing0.5. Input: an untrimmed video ... ... ... ... ... ... ... ... ... ... 108.8 s 112.8 s Ground truth: action label Background CliffDiving Background ... ... ... ... ... ... ... ... ... ... Output: temporal action localization 109.4 s 113.6 s Background CliffDiving Background (a) CliffDiving Input: an untrimmed video ... ... ... ... ... ... ... ... ... ... 53.8 s 70.9 s Ground truth: action label Background LongJump Background ... ... ... ... ... ... ... ... ... ... Output: temporal action localization 53.1 s 71.3 s Background LongJump Background (b) LongJump Figure4.QualitativetemporalactionlocalizationresultsoftheproposedSegment-tubedetectorfor twoactioninstances,i.e.,(a)CliffDivingand(b)LongJump,inthetestingsplitoftheTHUMOS2014 dataset,withintersection-over-union(IoU)thresholdbeing0.5.

Description:

Alibaba Group, Hangzhou 311121, China; [email protected]. 4 With applications in untrimmed videos with improved spatial accuracy in mind, we In summary, the contributions of this paper are as follows: .. Given an untrimmed video, we first generate saliency-aware video clips via

Spatio-Temporal Action Localization in Untrimmed Videos with Per-Frame Segmentation PDF

20 Pages·2017·5.13 MB·English

Checking for file health...

Save to my drive

Quick download

Download

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Spatio-Temporal Action Localization in Untrimmed Videos with Per-Frame Segmentation

Description:

See more

The list of books you might like

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.