Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs ZhengShou,DongangWang,andShih-FuChang ColumbiaUniversity NewYork,NY,USA {zs2262,dw2648,sc250}@columbia.edu 6 1 0 Abstract Moststate-of-the-artmethodsrelyonmanuallyselected 2 features,andtheirperformancesstillrequiremuchimprove- r We address temporal action localization in untrimmed ment. For example, top performing approaches in THU- p long videos. This is important because videos in real ap- MOSChallenge2014[27,41,17,15]and2015[46,9]both A plications are usually unconstrained and contain multiple used improved Dense Trajectory (iDT) with Fisher Vector 1 action instances plus video content of background scenes (FV)[40,25]. Therehavebeensomerecentattemptsatin- 2 or other activities. To address this challenging issue, we corporatingiDTfeatureswithappearancefeaturesautomat- exploit the effectiveness of deep networks in temporal ac- icallyextractedbyframe-leveldeepnetworks[27,41,17]. ] V tionlocalizationviathreesegment-based3DConvNets: (1) Nevertheless,such2DConvNetsdonotcapturemotionin- aproposalnetworkidentifiescandidatesegmentsinalong formation,whichisimportantformodelingactionsandde- C videothatmaycontainactions;(2)aclassificationnetwork terminingtheirtemporalboundaries. . s learnsone-vs-allactionclassificationmodeltoserveasini- As an analogy in still images, object detection recently c tializationforthelocalizationnetwork; and(3)alocaliza- achieved large improvements by using deep networks. In- [ tionnetworkfine-tunesthelearnedclassificationnetworkto spiredbyRegion-basedConvolutionalNeuralNetworks(R- 2 localizeeachactioninstance.Weproposeanovellossfunc- CNN)[7]anditsupgradedversions[6,30,21],wedevelop v tionforthelocalizationnetworktoexplicitlyconsidertem- Segment-CNN1,whichisaneffectivedeepnetworkframe- 9 2 poraloverlapandachievehightemporallocalizationaccu- workfortemporalactionlocalizationasoutlinedinFigure 1 racy. In the end, only the proposal network and the local- 1. We adopt 3D ConvNets [13, 37], which recently has 2 ization network are used during prediction. On two large- been shown to be promising for capturing motion charac- 0 scale benchmarks, our approach achieves significantly su- teristics in videos, and add a new multi-stage framework. . 1 perior performances compared with other state-of-the-art First, multi-scale segments are generated as candidates for 0 systems:mAPincreasesfrom1.7%to7.4%onMEXaction2 threedeepnetworks. Theproposalnetworkclassifieseach 6 andincreasesfrom15.0%to19.0%onTHUMOS2014. segmentaseitheractionorbackgroundinordertoeliminate 1 backgroundsegmentestimatedtobeunlikelytocontainac- : v tions of interest. The classification network trains typical Xi 1.Introduction one-vs-allclassificationmodelforallactioncategoriesplus thebackground. r Impressive progress has been reported in recent litera- a However, the classification network aims at finding key tureforactionrecognition[42,28,2,3,39,40,24,18,31, evidencestodistinguishdifferentcategories,ratherthanlo- 44,13,37]. Besidesdetectingactioninmanuallytrimmed calizing precise action presences in time. Sometimes, the shortvideo, researchersstarttodeveloptechniquesforde- scores from the classification network can be high even tectingactionsinuntrimmedlongvideosinthewild. This when the segment has only a very small overlap with the trendmotivatesanotherchallengingtopic-temporalaction groundtruthinstance. Thiscanbedetrimentalbecausesub- localization: given a long untrimmed video, “when does a sequentpost-processingsteps,suchasNon-MaximumSup- specific action start and end?” This problem is important pression(NMS),mightremovesegmentofsmallscorebut because real applications usually involve long untrimmed large overlap with ground truth. To explicitly take tempo- videos, which can be highly unconstrained in space and ral overlap into consideration, we introduce the localiza- time, and one video can contain multiple action instances tion network based on the same architecture, but this net- plus background scenes or other activities. Localizing ac- tionsinlongvideos,suchasthoseinsurveillance,cansave 1Sourcecodeandtrainedmodelsareavailableonlineathttps:// tremendoustimeandcomputationalcosts. github.com/zhengshou/scnn/. Input untrimmed video Background CliffDiving Background ... ... · baaccktigorno u:nd: 00..8155 ssoLxamtfoS Segment 3D ConvNets fc6fc7 fc8 Ppro action or not? (1) Proposal ... generate varied 16 frames length segments ... 32 frames background: 0.01 ... action 1: 0.02 64 fram...es ... ... ... aaccttii...oonn 2K:: 00..90...21 ssoLxamtfoS (a) Multi-scale segment generation Segment 3D ConvNets fc6fc7 fc8 Pcls class-specific (2) Classification label NMS temporal overlap background: 0.02 Background CliffDiving Background action 1: 0.03 noitciderPCClliiffffDDiivviinngg ... 00..7850 X√ ... Segment 3D.. .ConvNets..(.3) Lfocc6alficz7atiofnc...8 aaccttii...oonn 2K:: 00P..80l...o22c classssoLla-sbp palrevOeelcific (c) Post-processing (b) Segment-CNN Figure1.Overviewofourframework. (a)Multi-scalesegmentgeneration: givenanuntrimmedvideo, wegeneratesegmentsofvaried lengthsviaslidingwindow;(b)Segment-CNN:theproposalnetworkidentifiescandidatesegments,theclassificationnetworktrainsan actionrecognitionmodeltoserveasinitializationforthelocalizationnetwork,andthelocalizationnetworklocalizesactioninstancesin time and outputs confidence scores; (c) Post-processing: using the prediction scores from the localization network, we further remove redundancybyNMStoobtainthefinalresults.Duringtraining,theclassificationnetworkisfirstlearnedandthenusedasinitializationfor thelocalizationnetwork.Duringprediction,onlytheproposalandlocalizationnetworksareused. work uses a novel loss function, which rewards segments truthiswithheldbyorganizersforfutureevaluation. More with higher temporal overlap with the ground truths, and detailedevaluationresultsareavailableinSection4. thuscangenerateconfidencescoresmoresuitableforpost- processing. Note that the classification network cannot be 2.Relatedwork replaced by the localization network. We will show later Temporalactionlocalization. Thistopichasbeenstudied that using the trained classification network (without con- intwodirections. Whentrainingdataonlyhavevideo-level sideringtemporaloverlap)toinitializethelocalizationnet- category labels but no temporal annotations, researchers work(takeintoaccounttemporaloverlap)isimportant,and formulated this as weakly supervised problems or multi- achievesbettertemporallocalizationaccuracies. ple instance learning problems to learn the key evidences Tosummarize,ourmaincontributionsarethree-fold: inuntrimmedvideosandtemporallylocalizeactionsbyse- (1)Tothebestofourknowledge,ourworkisthefirstto lecting key instances [22, 23]. Sun et al. [36] transferred exploit3DConvNetswithmulti-stageprocessesfortempo- knowledge from web images to address temporal localiza- ralactionlocalizationinuntrimmedlongvideosinthewild. tioninuntrimmedwebvideos. (2)Weintroduceaneffectivemulti-stageSegment-CNN Anotherlineofworkfocusesonlearningfromdatawhen framework, to propose candidate segments, recognize ac- thetemporalboundarieshavebeenannotatedforactionin- tions, andlocalizetemporalboundaries. Theproposalnet- stances in untrimmed videos, such as THUMOS. Most of workimprovestheefficiencybyeliminatingunlikelycandi- theseworksposethisasaclassificationproblemandadopt date segments, and the localization network is key to tem- a temporal sliding window approach, where each window porallocalizationaccuracyboosting. is considered as an action candidate subject to classifica- (3)Theproposedtechniquessignificantlyoutperformthe tion [25]. Surveys about action classification methods can state-of-the-art systems over two large-scale benchmarks be found in [42, 28, 2, 3]. Recently, two directions lead suitablefortemporalactionlocalization. Whentheoverlap thestate-of-the-art: (1)Wangetal. [39]proposedextract- thresholdusedinevaluationissetto0.5,ourapproachim- ingHOG,HOF,MBHfeaturesalongdensetrajectories,and provesmAPonMEXaction2from1.7%to7.4%andmAP later on they took camera motion into consideration [40]. onTHUMOS2014from15.0%to19.0%. Wedidnoteval- Further improvement can be achieved by stacking features uateonTHUMOSChallenge2015[9]becausetheground with multiple time skips [24]. (2) Enlighted by the suc- cessofCNNsinrecentworks[20,32],Karpathyetal. [18] to share the computation of one image in ConvNets [6]. evaluated frame-level CNNs on large-scale video classifi- Our work differs from R-CNN in the following aspects: cation tasks. Simonyan and Zisserman [31] designed two- (1)Temporalannotationsintrainingvideoscanbediverse: streamCNNstolearnfromstillimageandmotionflowre- some are cleanly trimmed action instances cut out from spectively. In [44], a latent concept descriptor of convo- longvideos,suchasUCF101[34],andsomeareuntrimmed lutional feature map was proposed, and great results were longvideosbutwithtemporalboundariesannotatedforac- achievedoneventdetectionwithVLADencoding. Tolearn tion instances, such as THUMOS [15, 9]. We provide a spatio-temporal features together, the architecture of 3D paradigmthatcanhandlesuchdiverseannotations. (2)As ConvNets was explored in [13, 37], achieving competitive proven in Faster R-CNN [30] which proposes region pro- results. Oneataetal. [26]proposedapproximatelynormal- posalnetwork,andDeepBox[21]whichdetectsobjectness izedFisherVectorstoreducethehighdimensionalityofFV. to re-rank the results of R-CNN, using deep networks for Stoian et al. [35] introduced a two-level cascade to allow learningobjectnessiseffectiveandefficient. Therefore,we fastsearchforactioninstances. Insteadofprecision, these directlyusedeepnetworktoclassifybackgroundandaction methodsfocusonimprovingtheefficiencyofconventional toobtaincandidatesegments. (3)Weremovetheregression methods. Tospecificallyaddressthetemporalprecisionof stage because learning regression for time shift and dura- action detection, Gaidon et al. [4, 5] modeled the struc- tion of video segment does not work well in our experi- ture of action sequence with atomic action units (actoms). ments, probably because actions can be quite diverse, and The explicit modeling of action units allows for matching therefore do not contain consistent patterns for predicting more complete action unit sequences, rather than just par- start/end time. To achieve precise localization, we design tial content. However, this requires mannual annotations thelocalizationnetworkusinganewlossfunctiontoexplic- foractoms, whichcanbesubjectiveandburdensome. Our itlyconsidertemporaloverlap. Thiscandecreasethescore paperpresentedhereaimstosolvethesameproblemofpre- forthesegmentthathassmalloverlapwiththegroundtruth, cisetemporallocalization,butwithoutrequiringthedifficult and increase the segment of larger overlap. This also ben- taskofmanualannotationforatomicactionunits. efits post-processing steps, such as NMS, to keep segment withhighertemporallocalizationaccuracy. Spatio-temporallocalization. Therehavebeenactiveex- plorationsaboutlocalizingactioninspaceandtimesimul- 3.DetaileddescriptionsofSegment-CNN taneously. Jain et al. [10] and Soomro et al. [33] built 3.1.Problemsetup their work on supervoxel. Recently, researchers treat this asatrackingproblem[43,8]byleveragingobjectdetectors Problem definition. We denote a video as X = {x }T t t=1 [11],especiallyhumandetectors[16,19,8,45]todetectre- wherex isthet-thframeinX, andT isthetotalnumber t gions of interest in each frame and then output sequences of frames in X. Each video X is associated with a set of of bounding boxes. Dense trajectories have also been ex- temporal action annotations Ψ = (cid:110)(cid:16)ψ ,ψ(cid:48) ,k (cid:17)(cid:111)M , ploited for extracting the action tubes [29, 38]. Jain et al. m m m m=1 whereM isthetotalnumberofactioninstancesinX,and [12]addedobjectencodingstohelpactionlocalization. k , ψ , ψ(cid:48) are, respectively, action category of the in- However, thisproblemisdifferentfromtemporallocal- m m m stance m and its starting time and ending time (measured ization, which is the main topic in this paper: (1) When by frame ID). k ∈ {1,...,K}, where K is the number using object detectors to find spatio-temporal regions of m ofcategories. Duringtraining,wehaveasetT oftrimmed interest, such approaches assume that the actions are per- videos and a set U of untrimmed videos. Each trimmed formedbyhumanorotherpre-definedobjects. (2)Spatio- videoX ∈T hasψ =1,ψ(cid:48) =T,andM =1. temporal localization requires exhaustive annotations for m m objects of interest on every frame as training data. This Multi-scale segment generation. First, each frame is re- makes it overwhelmingly time-consuming particularly for sizedto171(width)×128(height)pixels. Foruntrimmed long untrimmed videos compared with the task of simply video X ∈ U, we conduct temporal sliding windows of labelingthestarttimeandendtimeofanactiondepictedin varied lengths as 16, 32, 64, 128, 256, 512 frames with thevideo,whichissufficienttosatisfymanyapplications. 75% overlap. For each window, we construct segment s by uniformly sampling 16 frames. Consequently, for Objectdetection. Inspiredbythesuccessofdeeplearning each untrimmed video X, we generate a set of candidates approachesinobjectdetection,wealsoreviewR-CNNand (cid:110)(cid:16) (cid:17)(cid:111)H its variations. R-CNN consists of selective search, CNN Φ= sh,φh,φ(cid:48)h asinputfortheproposalnetwork, h=1 feature extraction, SVM classification, and bounding box whereH isthetotalnumberofslidingwindowsforX,and regression[7]. FastR-CNNreshapesR-CNNintoasingle- φ and φ(cid:48) are respectively starting time and ending time m m stageusingmulti-taskloss,andalsohasaRoIpoolinglayer oftheh-thsegments . FortrimmedvideoX ∈ T,wedi- h rectlysampleasegmentsof16framesfromX inuniform. tionsdependsonthescaleofdatasetandwillbeclarifiedin Section4. Networkarchitecture. 3DConvNetsconducts3Dconvo- Note that, compared with the multi-class classification lution/pooling which operates in spatial and temporal di- network, thisproposalnetworkissimplerbecausetheout- mensions simultaneously, and therefore can capture both putlayeronlyconsistsoftwonodes(actionorbackground). appearance and motion for action. Given the competitive performances on video classification tasks, our deep net- The classification network: After substantial background works use 3D ConvNets as the basic architecture in all segmentsareremovedbytheproposalnetwork, wetraina stages and follow the network architecture of [37]. All classificationmodelΘ forK actioncategoriesaswellas cls 3D pooling layers use max pooling and have kernel size background. of2×2inspatialwithstride2,whilevaryintemporal. All PreparingthetrainingdataS followsasimilarstrategy cls 3D convolutional filters have kernel size 3 and stride 1 in fortheproposalnetwork. Exceptwhenassigninglabelfor all three dimensions. Using the notations conv(number positivesegment,theclassificationnetworkexplicitlyindi- of filters) for the 3D convolutional layer, pool(temporal catesactioncategoryk ∈{1,...,K}. Moreover,inorder m kernel size, temporal stride) for the 3D pooling layer, and tobalancethenumberoftrainingdataforeachclass,were- ofuct(noufmtbheerseoftfihrleteerst)yfpoersthoefflualylyercsoinnneocuterdalracyheitre,cthtuerelayis- ducethenumberofbackgroundinstancestoNb ≈ NTK+NU. As for parameters in SGD, the learning rate is 0.0001, as follows: conv1a(64) - pool1(1,1) - conv2a(128) - with the exception of 0.01 for fc8, momentum is 0.9, pool2(2,2) - conv3a(256) - conv3b(256) - pool3(2,2) - weight decay factor is 0.0005, and the learning rate is di- conv4a(512)-conv4b(512)-pool4(2,2)-conv5a(512) videdbyafactorof2forevery10Kiterations,becausethe - conv5b(512) - pool5(2,2) - fc6(4096) - fc7(4096) - convergenceshallbeslowerwhenthenumberofclassesin- fc8(K+1). Eachinputforthisdeepnetworkisasegment creases. s of dimension 171×128×16. C3D is training this net- workonSports-1Mtrainsplit[37],andweuseC3Dasthe The localization network: As illustrated in Figure 2, it is initializationforourproposalandclassificationnetworks. important to push up the prediction score of the segment 3.2.Trainingprocedure with larger overlap with the ground truth instance and de- crease the scores of the segment with smaller overlap, to Theproposalnetwork: WetrainaCNNnetworkΘpro as make sure that the subsequent post-processing steps can thebackgroundsegmentfilter.Basically,fc8hastwonodes choosesegmentswithhigheroverlapoverthosewithsmall thatcorrespondinglyrepresentthebackground(rarelycon- overlap. Therefore, we propose this localization network tains action of interest) and being-action (has significant Θ withanewlossfunction,whichtakesIoUwithground loc portionbelongstotheactionsofinterest). truthinstanceintoconsideration. We use the followingstrategy to construct training data Training data S for the localization network are aug- S = {(s ,k )}N ,wherelabelk ∈ {0,1}. Foreach loc pro n n n=1 n mented from Scls by associating each segment s with the segment of the trimmed video X ∈ T, we set its label measurementofoverlap,v. Inspecific,wesetv = 1fors as positive. For candidate segments from an untrimmed fromtrimmedvideo. Ifscomesfromuntrimmedvideoand video X ∈ U with temporal annotation Ψ, we assign a haspositivelabelk,wesetvequaltotheoverlap(measured label for each segment by evaluating its Intersection-over- by IoU) of segment s with the associated ground truth in- Union (IoU) with each ground truth instance in Ψ : (1) stance. Ifsisabackgroundsegment,aswecanseelater,its if the highest IoU is larger than 0.7, we assign a positive label; (2) if the highest IoU is smaller than 0.3, we set it as the background. On the perspective of ground truth, if Background CliffDiving Background there is no segment that overlaps with a ground truth in- stance with IoU larger than 0.7, then we assign a positive A: CliffDiving 0.95 Keep label segment s if s has the largest IoU with this ground B: CliffDiving 0.90 Remove truth and its IoU is higher than 0.5. At last, we obtain t1 t2 C: CliffDiving 0.85 Keep S ={(s ,k )}Npro whichconsistsofallN +N posi- pro n n n=1 T U Figure2.Typicalcaseofbadlocalizations. Assumethatthesys- tivesegmentsandN ≈N +N randomlysampledback- b T U temoutputsthreepredictions: A,B,C.Probablyduetothatthere groundsegments,whereN =N +N +N . pro T U b aresomeevidencesduring[t ,t ],andAhasthehighestpredic- 1 2 Inallexperiments,weusealearningrateof0.0001,with tionscore. Therefore,theNMSwillkeepA,removeB,andthen theexceptionof0.01forfc8,momentumof0.9,weightde- keepC.However,actuallywehopetoremoveAandCinNMS, cayfactorof0.0005,anddropthelearningratebyafactor and keep B because B has the largest IoU with the ground truth of 10 for every 10K iterations. The number of total itera- instance. oanvderglarpadmieenatscuoremmpeuntattviowniilnlnboatcakf-fpercotpoaugrantieown,loasnsdftuhnucstiwone 0.91 0.91 vvv===001..575 0.8 0.8 simplysetitsvas1. {to(rsDno,ufkrfinnc,g8vneis)a}cONnh=nm1a.inndFi-obtrhaettchhpe,renwd-itechthisoaenvgemscNeonrte,trvtaheinecitnoogurtapsfuatmetrvptelheces- Lsoftmax0000....4567 ·λL+maxoverlap0000....4567 softmax layer is P . Note that for the i-th class, P(i) = 0.3 Lsoft0.3 n n 0.2 0.2 eOn(i) . Thenewlossfunctionisformedbycombining 0.1 0.1 (cid:80)Nj=1eOn(j) 00 0.2 0.4 0.6 0.8 1 00 0.2 0.4 0.6 0.8 1 LsoftmaxandLoverlap: Pn(kn) Pn(kn) Figure 3. An illustration of how L works compared with L=L +λ·L , (1) overlap softmax overlap L foreachpositivesegment. Hereweuseα = 1,λ = 1, softmax where λ balances the contribution from each part, and andvaryoverlapvinLoverlap. Thex-axisisthepredictionscore throughempiricalvalidation,wefindthatλ=1workswell atthenodethatcorrespondstotruelabel,andthey-axisistheloss. inpractice. L istheconventionalsoftmaxlossandis softmax definedas Given a training sample (s ,k ,v ), Figure 3 shows n n n 1 (cid:88)(cid:16) (cid:16) (cid:17)(cid:17) Lsoftmax = N −log Pn(kn) , (2) how Loverlap influences the original softmax loss. It also provides more concrete insights about the design of this n lossfunction. (1)Ifthesegmentbelongstothebackground, whichiseffectivefortrainingdeepnetworksforclassifica- L = 0andL = L . (2)Ifthesegmentisposi- tion. L isdesignedtojointlyreducetheclassification overlap softmax erroranodvearldajpusttheintensityofconfidencescoreaccording tive,LreachstheminimumatPn(kn) =(cid:112)(vn)α,andthere- totheextentofoverlap: fore penalizes two cases: either P(kn) is too small due to n (cid:16) (cid:17)2 misclassification,orPn(kn) explodesandexceedsthelearn- L = 1 (cid:88)1 · Pn(kn) −1·[k >0]. ingtarget(cid:112)(vn)αwhichisproportionaltooverlapvn.Also overlap N 2 (v )α n notethatLisdesignedtoincreaseasvn decreases,consid- n n ering that the training segment with smaller overlap with (3) groundtruthinstanceislessreliablebecauseitmayinclude Here,[kn >0]isequalto1whenthetrueclasslabelkn is considerablenoise.(3)Inparticular,ifthispositivesegment positive,anditisequalto0whenkn =0,whichmeansthe hasoverlapvn =1,thelossfunctionbecomessimilartothe sn isabackgroundtrainingsample. Loverlap isintendedto softmax loss, and L gradually decreases from +∞ to 1 as boost the detection scores (P) of segments that have high P(kn)goesfrom0to1. n overlaps (v) with ground truth instances, and reduce the Duringoptimization,Θ isfine-tunedonΘ .Because loc cls scores of those with small overlaps. The hyper-parameter doingclassificationisalsooneobjectiveofthelocalization αcontrolstheadjustmentrangefortheintensityofthecon- network, and a trained classification network can be good fidencescore. ThesensitivityofαisexploredinSection4. initialization. We use the same learning rate, momentum, Inaddition,thetotalgradientw.r.toutputofthei-thnodein and weight decay factor as for the classification network. fc8isasfollows: Otherparametersdependingonthedatasetareindicatedin ∂L = ∂Lsoftmax +λ· ∂Loverlap, (4) Section4. ∂O(i) ∂O(i) ∂O(i) n n n 3.3.Predictionandpost-processing inwhich (cid:40) (cid:16) (cid:17) Duringprediction,weslidevariedlengthtemporalwin- ∂Lsoftmax = N1 · Pn(kn)−1 ifi=kn (5) dowtogenerateasetofsegmentsandinputthemintoΘpro ∂On(i) N1 ·Pn(i) ifi(cid:54)=kn toobtainproposalscoresPpro. Inthispaper,wekeepseg- ments with P ≥ 0.7. Then we evaluate the retained pro and segmentsbyΘ toobtainactioncategorypredictionsand loc ∂L N1 ·(cid:18)(P(nv(knn)α))2 ·(cid:16)1−Pn(kn)(cid:17)(cid:19)·i[fkin=>k0] mcaollunslfiteidgpemlyneicnnegtsswcpoirrtehedsiccPltalesodsc-.asDspetuhcreiifinbcgafcprkoegsqtru-opeurnnocdcyeaosnsfdinorgcec,fiuwnreererPneclmoecofbvoyer ∂oOven(ril)ap = N1 ·(cid:18)(P(nv(knn)α))2 ·(cid:16)−Pn(i)(cid:17)(cid:19)·[kinfi>(cid:54)=0]kn . eddaoacnwht dlweenitnegdcthtoiwodnissletranirgbeuthtnioointnatphllaeotwttererandisn.iinngFeivndaaalltulaayt,tioobnel,ecvaweuersaecgoerenwdduuinnc--t n (6) NMS based on P to remove redundant detections, and loc settheoverlapthresholdinNMStoalittlebitsmallerthan achieves tremendous performance gain for “BullCharge- theoverlapthresholdθinevaluation(θ−0.1inthispaper). Cape” action and competitive performance for “HorseRid- ing” action. Figure 5 displays our prediction results for 4.Experiments “BullChargeCape”and“HorseRiding”,respectively. 4.1.Datasetsandsetup AP(%) BullChargeCape HorseRiding mAP DTF 0.3 3.1 1.7 MEXaction2[1]. Thisdatasetcontainstwoactionclasses: S-CNN 11.6 3.1 7.4 “BullChargeCape” and “HorseRiding”. This dataset con- sists of three subsets: INA videos, YouTube clips, and Table1.AverageprecisiononMEXaction2.Theoverlapthreshold UCF101 Horse Riding clips. YouTube clips and UCF101 issetto0.5duringevaluation. Horse Riding clips are trimmed, whereas INA videos are untrimmed and are approximately 77 hours in total. With regardtoactioninstanceswithtemporalannotation,theyare Results on THUMOS 20143: The instances in train set divided into train set (1336 instances), validation set (310 and validation set are used for training. The number of instances),andtestset(329instances). training iterations is 30K for all three networks. We again set α = 0.25 for the localization network. We denote our THUMOS2014[15].Thetemporalactiondetectiontaskin Segment-CNNusingtheabovesettingsasS-CNN. THUMOSChallenge2014isdedicatedtolocalizingaction instancesinlonguntrimmedvideos. Thedetectiontaskin- θ 0.1 0.2 0.3 0.4 0.5 volves20categoriesasindicatedinFigure4. Thetrimmed Karamanetal. [17] 1.5 0.9 0.5 0.3 0.2 videosusedfortrainingare2755videosofthese20actions Wangetal. [41] 19.2 17.8 14.6 12.1 8.5 in UCF101. The validation set contains 1010 untrimmed Oneataetal. [27] 39.8 36.2 28.8 21.8 15.0 videos with temporal annotations of 3007 instances in to- S-CNN 47.7 43.5 36.3 28.7 19.0 tal. The test set contains 3358 action instances from 1574 Table2.MeanaverageprecisiononTHUMOS2014astheoverlap untrimmed videos, whereas only 213 of them contain ac- IoUthresholdθusedinevaluationvaries. tion instances of interest. We exclude the remaining 1361 backgroundvideosinthetestset. As for comparisons, beyond DTF, several baseline sys- temsincorporateframe-leveldeepnetworksandevenutilize 4.2.Comparisonwithstate-of-the-artsystems lotsofotherfeatures: (1)Karamanetal. [17]usedFVen- Evaluation metrics. We follow the conventional met- coding of iDT with weighted saliency based pooling, and rics used in THUMOS Challenge to regard temporal ac- conducted late fusion with frame-level CNN features. (2) tion localization as a retrieval problem, and evaluate aver- Wang et al. [41] built a system on iDT with FV represen- ageprecision(AP).Apredictionismarkedascorrectonly tation and frame-level CNN features, and performed post- when it has the correct category prediction, and has IoU processing to refine the detection results. (3) Oneata et al. withgroundtruthinstancelargerthantheoverlapthreshold [27] conducted localization using FV encoding of iDT on (measured byIoU). Note that redundantdetections are not temporal sliding windows, and performed post-processing allowed. following[25]. Finally,theyconductedweightedfusionfor thelocalizationscoresoftemporalwindowsandvideo-level scoresgeneratedbyclassifierstrainedoniDTfeatures,im- Results on MEXaction2. We build our system based on age features, and audio features. The results are listed in Caffe [14] and C3D [37]. We use the train set in MEX- Table 2. AP for each class can be found in Figure 4. Our action2 for training. The number of training iterations is Segment-CNN significantly outperforms other systems for 30Kfortheproposalnetwork,20Kfortheclassificationnet- 14 of 20 actions, and the average performance improves work,and20Kinthelocalizationnetworkwithα=0.25. from15.0%to19.0%. Wealsoshowtwopredictionresults We denote our Segment-CNN using the above settings fortheTHUMOS2014testsetinFigure6. as S-CNN and compare with typical dense trajectory fea- tures (DTF) with bag-of-visual-words representation. The Efficiency analysis. Our approach is very efficient when resultsofDTFisprovidedby[1]2,whichtrainsthreeSVM compared with all other systems, which typically fuse dif- models with different set of negative samples and aver- ferent features, and therefore can become quite cumber- agesAPoverall. AccordingtoTable1,ourSegment-CNN 3Note that the evaluation toolkit used in THUMOS 2014 has some 2Notethattheresultsreportedin[1]usedifferentevaluationmetrics.To bugs,andrecentlytheorganizersreleasedanewtoolkitwithfairevalua- makethemcomparable,were-evaluatetheirpredictionresultsaccording tioncriteria.Here,were-evaluatethesubmissionresultsofallteamsusing tostandardcriteriamentionedinSection4.2. theupdatedtoolkit. 50 Karaman et al. Wang et al. Oneata et al. S−CNN %) 40 n( o sici 30 e Pr e 20 g a er v 10 A 7:Ba0s eb9a:llBPiatsckhetballDunk12:B2ill1i:aCrldesanAndJ2er2:kClif2f3D:iCvriincgketBow2li4:nCgricketShot26:3Di1:viFnrigsbeeCat3c3:hGol3f6:SHwianmgmerThr4o0w:High4J5:uJmapvelinThr5o1:wLongJum68p:PoleVaul7t9:S8h5:otSpouctcerPe9n2:alTtyennis9S3:wiTnhgr9o7w:VDiolslceuysballSpiking mAP Figure4.Histogramofaverageprecision(%)foreachclassonTHUMOS2014whentheoverlapthresholdissetto0.5duringevaluation. some. Mostsegmentsgeneratedfromslidingwindowsare networks S-CNN(w/oproposal) S-CNN removedbythefirstproposalnetwork,andthustheopera- mAP(%) 17.1 19.0 tions in classification and localization are greatly reduced. Table 3. mAP comparisons on THUMOS 2014 between remov- Foreachbatch,thespeedisaround1second,andthenum- ingtheproposalnetworkandkeepingtheproposalnetwork. The berofsegmentscanbeprocessedduringeachbatchdepends overlapthresholdissetto0.5duringevaluation. ontheGPUmemory(approximately25forGeForceGTX 980 of 4G memory). The storage requirement is also ex- tremely small because our method does not need to cache finalpredictionresultstoselectκsegmentswithmaximum intermediatehighdimensionalfeatures,suchasFVtotrain confidence scores. As shown in Figure 7, S-CNN fine- SVM.AllrequiredbySegment-CNNisthreedeepnetwork tunedonΘ outperformsS-CNN(w/oclassification)con- cls models,whichoccupylessthan1GBintotal. sistentlywhenκvaries,andconsequentlytheclassification networkisnecessaryduringtraining. 4.3.Impactofindividualnetworks To study the effects of each network individually, we 20 compare four Segment-CNNs using different settings: (1) 18 S-CNN: keep all three networks and settings in Section 4.2, and Θ is fine-tuned on Θ ; (2) S-CNN (w/o pro- 16 loc cls %) posal): remove the proposal network completely, and di- P( 14 A rectly use Θ to do predictions on sliding windows; (3) m loc 12 S-CNN S-CNN(w/oclassification): removetheclassificationnet- S-CNN(w/oclassification) workcompletelyandthusdonothaveΘ toserveasini- 10 S-CNN(w/olocalization) cls tializationfortrainingΘ ;(4)S-CNN(w/olocalization): 8 loc 0 0.5 1 1.5 2 2.5 removethelocalizationnetworkcompletelyandinsteaduse κ x 104 classificationmodelΘ toproducepredictions. cls Figure 7. Effects of the classification and localization networks. y-axisismAP(%)onTHUMOS2014,andx-axisvariesthedepth The proposal network. We compare S-CNN (w/o pro- κ in top-κ selection. The overlap threshold is set to 0.5 during posal)andS-CNN,whichincludestheproposalnetworkas evaluation. describedabove(twonodesinfc8). Becauseofthesmaller network architecture than S-CNN (w/o proposal), S-CNN can reduce the number of operations conducted on back- The localization network. Figure 7 also proves the ef- ground segments, and therefore accelerate speed. In addi- fectiveness of the localization network. By adding the lo- tion, the results listed in Table 3 demonstrate that keeping calization network, S-CNN can significantly improve per- theproposalnetworkcanalsoimproveprecisionbecauseit formances compared with the baseline S-CNN (w/o local- isdesignedforfilteringoutbackgroundsegmentsthatlack ization), which only contains the proposal and classifica- actionofinterests. tion networks. This is because the new loss function in- Theclassificationnetwork. AlthoughΘ isnotuseddur- troduced in the localization network refines the scores in cls ing prediction, the classification network is still important favoringsegmentsofhigheroverlapwiththegroundtruths, because fine-tuning on Θ results in better performance. andthereforehighertemporallocalizationaccuracycanbe cls During evaluation here, we perform top-κ selection on the achieved. … … Background BullChargeCape Background Ground truth: #8631 #8673 #8633 A: BullChargeCape √ #8664 Prediction: #8649 B: BullChargeCape X #8680 … … Background HorseRiding Background Ground truth: #131368 #131438 Prediction: #131393 A: HorseRiding √ #131456 #131345 B: HorseRiding X #131408 Figure 5. Prediction results for two action instances from MEXaction2 when the overlap threshold is set to 0.5 during evaluation. For eachgroundtruthinstance,weshowtwopredictionresults:Ahasthehighestconfidencescoreamongthepredictionsassociatedwiththis groundtruth,andBisanincorrectprediction.BullChargeCape:Aiscorrect,butBisincorrectbecauseeachgroundtruthonlyallowsone detection. HorseRiding: Aiscorrect,butBisincorrectbecauseeachgroundtruthonlyallowsonedetection. Thenumbersshownwith# areframeIDs. … … Background ClearAndJerk Background Ground truth: 176.9 s 191.0 s 179.2 s A: ClearAndJerk √ 189.4 s Prediction: 185.6 s B: ClearAndJerk X 190.7 s … … Background LongJump Background Ground truth: 10.3 s 14.5 s 9.0 s A: LongJump √ 14.1 s Prediction: 10.3 s B: PoleVault X 12.8 s Figure6.PredictionresultsfortwoactioninstancesfromTHUMOS2014testsetwhentheoverlapthresholdissetto0.5duringevaluation. Foreachgroundtruthinstance,weshowtwopredictionresults:Ahasthehighestconfidencescoreamongthepredictionsassociatedwith thisgroundtruth,andBisanincorrectprediction.ClearAndJerk:Aiscorrect,butBisincorrectbecauseitsoverlapIoUwithgroundtruth islessthanthreshold0.5.LongJump:Aiscorrect,butBisincorrectbecauseithasthewrongactioncategoryprediction-PoleVault. Inaddition,wevaryαintheoverlaplosstermL of therefore precisely localizing action instances in time can overlap thelossfunctiontoevaluateitssensitivity. Wefindthatour behelpfulfortheirrecognitionanddetection. approach has stable performances over a range of α value (e.g.,from0.25to1.0). 6.Acknowledgment 5.Conclusion ThisworkissupportedbytheIntelligenceAdvancedRe- We propose an effective multi-stage framework called searchProjectsActivity(IARPA)viaDepartmentofInterior Segment-CNN to address temporal action localization in National Business Center contract number D11PC20071. untrimmed long videos. Through the above evaluation for The U.S. Government is authorized to reproduce and dis- each network, we demonstrate the contribution from the tributereprintsforGovernmentalpurposesnotwithstanding proposal network to identify candidate segments, the ne- any copyright annotation thereon. Disclaimer: The views cessityoftheclassificationnetworktoprovidegoodinitial- and conclusions contained herein are those of the authors ization for training the localization model, and the effec- and should not be interpreted as necessarily representing tiveness of the new loss function used in the localization theofficialpoliciesorendorsements,eitherexpressedorim- networktopreciselylocalizeactioninstancesintime. plied, of IARPA, DOI-NBC, or the U.S. Government. We Inthefuture,wewouldliketoextendourworktoevents thank Dong Liu, Guangnan Ye, and anonymous reviewers and activities, which usually consist of multiple actions, fortheinsightfulsuggestions. References [20] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In [1] Mexaction2. http://mexculture.cnam.fr/ NIPS,2012. 3 xwiki/bin/view/Datasets/Mex+action+ [21] W. Kuo, B. Hariharan, and J. Malik. Deepbox: Learning dataset,2015. 6 objectnesswithconvolutionalnetworks.InICCV,2015.1,3 [2] J.K.AggarwalandM.S.Ryoo. Humanactivityanalysis:A [22] K.-T.Lai,D.Liu,M.-S.Chen,andS.-F.Chang. Recogniz- review. InACMComputingSurveys,2011. 1,2 ingcomplexeventsinvideosbylearningkeystatic-dynamic [3] G.Cheng,Y.Wan,A.N.Saudagar,K.Namuduri,andB.P. evidences. InECCV,2014. 2 Buckles. Advancesinhumanactionrecognition: Asurvey. [23] K.-T. Lai, F. X. Yu, M.-S. Chen, and S.-F. Chang. Video 2015. 1,2 event detection by inferring temporal instance labels. In [4] A.Gaidon,Z.Harchaoui,andC.Schmid. Actomsequence CVPR,2014. 2 modelsforefficientactiondetection. InCVPR,2011. 3 [24] Z.Lan, M.Lin, X.Li, A.G.Hauptmann, andB.Raj. Be- [5] A.Gaidon, Z.Harchaoui, andC.Schmid. Temporallocal- yondgaussianpyramid: Multi-skipfeaturestackingforac- izationofactionswithactoms. InTPAMI,2013. 3 tionrecognition. InCVPR,2015. 1,2 [6] R.Girshick. Fastr-cnn. InICCV,2015. 1,3 [25] D. Oneata, J. Verbeek, and C. Schmid. Action and event [7] R.Girshick,J.Donahue,T.Darrell,andJ.Malik. Richfea- recognitionwithfishervectorsonacompactfeatureset. In ture hierarchies for accurate object detection and semantic ICCV,2013. 1,2,6 segmentation. InCVPR,2014. 1,3 [26] D.Oneata, J.Verbeek, andC.Schmid. Efficientactionlo- calizationwithapproximatelynormalizedfishervectors. In [8] G.GkioxariandJ.Malik. Findingactiontubes. InCVPR, CVPR,2014. 3 2015. 3 [27] D.Oneata,J.Verbeek,andC.Schmid. Thelearsubmission [9] A. Gorban, H. Idrees, Y.-G. Jiang, A. R. Zamir, I. Laptev, atthumos2014. InECCVTHUMOSWorkshop,2014. 1,6 M.Shah,andR.Sukthankar. THUMOSchallenge: Action [28] R.Poppe. Asurveyonvision-basedhumanactionrecogni- recognitionwithalargenumberofclasses. http://www. tion. InImageandvisioncomputing,2010. 1,2 thumos.info/,2015. 1,2,3 [29] M.M.Puscas,E.Sangineto,D.Culibrk,andN.Sebe. Un- [10] M. Jain, J. van Gemert, H. Je´gou, P. Bouthemy, and supervised tube extraction using transductive learning and C. Snoek. Action localization with tubelets from motion. densetrajectories. InICCV,2015. 3 InCVPR,2014. 3 [30] S.Ren,K.He,R.Girshick,andJ.Sun.Fasterr-cnn:Towards [11] M. Jain, J. van Gemert, T. Mensink, and C. Snoek. Ob- real-timeobjectdetectionwithregionproposalnetworks. In jects2action: Classifyingandlocalizingactionswithoutany NIPS,2015. 1,3 videoexample. InICCV,2015. 3 [31] K.SimonyanandA.Zisserman. Two-streamconvolutional [12] M.Jain,J.vanGemert,andC.Snoek.Whatdo15,000object networksforactionrecognitioninvideos. InNIPS,2014. 1, categoriestellusaboutclassifyingandlocalizingactions? In 3 CVPR,2015. 3 [32] K. Simonyan and A. Zisserman. Very deep convolutional [13] S.Ji,W.Xu,M.Yang,andK.Yu. 3dconvolutionalneural networksforlarge-scaleimagerecognition. InInternational networksforhumanactionrecognition. InTPMAI,2013. 1, ConferenceonLearningRepresentations,2015. 3 3 [33] K.Soomro,H.Idrees,andM.Shah. Actionlocalizationin [14] Y.Jia,E.Shelhamer,J.Donahue,S.Karayev,J.Long,R.Gir- videosthroughcontextwalk. InICCV,2015. 3 shick,S.Guadarrama,andT.Darrell. Caffe: Convolutional [34] K.Soomro,A.R.Zamir,andM.Shah. UCF101: Adataset architectureforfastfeatureembedding. InACMMM,2014. of 101 human actions classes from videos in the wild. In 6 CRCV-TR-12-01,2012. 3 [15] Y.-G. Jiang, J. Liu, A. R. Zamir, G. Toderici, I. Laptev, [35] A.Stoian,M.Ferecatu,J.Benois-Pineau,andM.Crucianu. M. Shah, and R. Sukthankar. THUMOS challenge: Ac- Fast action localization in large scale video archives. In tion recognition with a large number of classes. http: TCSVT,2015. 3 //crcv.ucf.edu/THUMOS14/,2014. 1,3,6 [36] C. Sun, S. Shetty, R. Sukthankar, and R. Nevatia. Tempo- [16] Z.Jiang,Z.Lin,andL.S.Davis.Aunifiedtree-basedframe- rallocalizationoffine-grainedactionsinvideosbydomain workforjointactionlocalization,recognitionandsegmenta- transferfromwebimages. InACMMM,2015. 2 tion. InCVIU,2013. 3 [37] D.Tran,L.Bourdev,R.Fergus,L.Torresani,andM.Paluri. [17] S.Karaman, L.Seidenari, andA.D.Bimbo. Fastsaliency Learningspatiotemporalfeatureswith3dconvolutionalnet- basedpoolingoffisherencodeddensetrajectories. InECCV works. InICCV,2015. 1,3,4,6 THUMOSWorkshop,2014. 1,6 [38] J.vanGemert,M.Jain,E.Gati,andC.Snoek. Apt: Action [18] A.Karpathy,G.Toderici,S.Shetty,T.Leung,R.Sukthankar, localization proposals from dense trajectories. In BMVC, andF.-F.Li. Large-scalevideoclassificationwithconvolu- 2015. 3 tionalneuralnetworks. InCVPR,2014. 1,3 [39] H. Wang, A. Kla¨ser, C. Schmid, and C.-L. Liu. Action [19] A.Kla¨ser,M.Marszałek,C.Schmid,andA.Zisserman.Hu- RecognitionbyDenseTrajectories. InCVPR,2011. 1,2 manfocusedactionlocalizationinvideo.InTrendsandTop- [40] H.WangandC.Schmid. Actionrecognitionwithimproved icsinComputerVision,2012. 3 trajectories. InICCV,2013. 1,2 [41] L.Wang,Y.Qiao,andX.Tang. Actionrecognitionandde- tection by combining motion and appearance features. In ECCVTHUMOSWorkshop,2014. 1,6 [42] D.Weinland,R.Ronfard,andE.Boyer. Asurveyofvision- based methods for action representation, segmentation and recognition. InComputerVisionandImageUnderstanding, 2011. 1,2 [43] P.Weinzaepfel, Z.Harchaoui, andC.Schmid. Learningto trackforspatio-temporalactionlocalization.InICCV,2015. 3 [44] Z.Xu,Y.Yang,andA.G.Hauptmann.Adiscriminativecnn videorepresentationforeventdetection. InCVPR,2015. 1, 3 [45] G.YuandJ.Yuan. Fastactionproposalsforhumanaction detectionandsearch. InCVPR,2015. 3 [46] J. Yuan, Y. Pei, B. Ni, P. Moulin, and A. Kassim. Adsc submissionatthumoschallenge2015. InCVPRTHUMOS Workshop,2015. 1