1 Reconstruction-free action inference from compressive imagers Kuldeep Kulkarni, Pavan Turaga Abstract—Persistentsurveillancefromcameranetworks,suchasatparkinglots,UAVs,etc.,oftenresultsinlargeamountsof videodata,resultinginsignificantchallengesforinferenceintermsofstorage,communicationandcomputation.Compressive cameras have emerged as a potential solution to deal with the data deluge issues in such applications. However, inference taskssuchasactionrecognitionrequirehighqualityfeatureswhichimpliesreconstructingtheoriginalvideodata.Muchworkin 5 compressive sensing (CS) theory is geared towards solving the reconstruction problem, where state-of-the-art methods are 1 computationally intensive and provide low-quality results at high compression rates. Thus, reconstruction-free methods for 0 inference are much desired. In this paper, we propose reconstruction-free methods for action recognition from compressive 2 camerasathighcompressionratiosof100andabove.RecognizingactionsdirectlyfromCSmeasurementsrequiresfeatures which are mostly nonlinear and thus not easily applicable. This leads us to search for such properties that are preserved in n compressivemeasurements.Tothisend,weproposetheuseofspatio-temporalsmashedfilters,whicharecompressivedomain a versionsofpixel-domainmatchedfilters.Weconductexperimentsonpubliclyavailabledatabasesandshowthatonecanobtain J recognitionratesthatarecomparabletotheoraclemethodinuncompressedsetup,evenforhighcompressionratios. 8 1 IndexTerms—CompressiveSensing,Reconstruction-free,Actionrecognition ] (cid:70) V C . s c [1 INTRODUCTION Recent advances in the areas of compressive sens- ing (CS) [1] have led to the development of new 1 Action recognition is one of the long standing sensors like compressive cameras (also called single- vresearch areas in computer vision with widespread pixel cameras (SPCs)) [2] greatly reduce the amount 7 applications in video surveillance, unmanned aerial 6 of sensed data, yet preserve most of its information. vehicles (UAVs), and real-time monitoring of pa- 3 More recently, InView Technology Corporation ap- 4tients. All these applications are heavily resource- plied CS theory to build commercially available CS 0constrained and require low communication over- workstations and SWIR (Short Wave Infrared) cam- .heads in order to achieve real-time implementation. 1 eras, thus equipping CS researchers with a hitherto 0Consider the application of UAVs which provide unavailable armoury to conduct experiments on real 5real-time video and high resolution aerial images on CS imagery. In this paper, we wish to investigate 1demand. In these scenarios, it is typical to collect : the utility of compressive cameras for action recog- van enormous amount of data, followed by trans- nition in improving the tradeoffs between reliability imission of the same to a ground station using a X of recognition and computational/storage load of the low-bandwidth communication link. This results in r system in a resource constrained setting. CS theory aexpensivemethodsbeingemployedforvideocapture, states that if a signal can be represented by very few compression, and transmission implemented on the numberofcoefficientsinabasis,calledthesparsifying aircraft. The transmitted video is decompressed at a basis, then the signal can be reconstructed nearly central station and then fed into a action recogni- perfectly even in the presence of noise, by sensing tion pipeline. Similarly, a video surveillance system sub-Nyquist number of samples [1]. SPCs differ from which typically employs many high-definition cam- conventional cameras in that they integrate the pro- eras, gives rise to a prohibitively large amount of cess of acquisition and compression by acquiring a data, making it very challenging to store, transmit small number of linear projections of the original and extract meaningful information. Thus, there is a images. More formally, when a sequence of images is growing need to acquire as little data as possible and acquiredbyacompressivecamera,themeasurements yet be able to perform high-level inference tasks like are generated by a sensing strategy which maps the action recognition reliably. space of P Q images, I RPQ to an observation space Z R×K, ∈ ∈ • K. Kulkarni and P. Turaga are with the School of Arts, Media and Engineering and School of Electrical, Computer and Energy Z(t)=φI(t)+w(t), (1) Engineering, Arizona State University. Email: [email protected], [email protected]. where φ is a K PQ measurement matrix, w(t) is the × noise, and K PQ. The process is pictorially shown (cid:28) 2 ratios. 1.1 Relatedwork a)ActionRecognition: Theapproachesinhuman action recognition from cameras can be categorized basedonthelowlevelfeatures.Mostsuccessfulrepre- Aframe Random ofthe sentations of human action are based on features like Pattern scene φM opticalflow,pointtrajectories,backgroundsubtracted {{ blobs and shape, filter responses, etc. Mori et al.[4] Aframeattimeinstanttis Z (t)= , and Cheung et al.[5] used geometric model based corcroemlaptirnegssiitvweliythseKnserdanbdyom 1 hφ1 Scenei and shape based representations to recognize actions. patternsoftheDMDarray, Z (t)= , Bobick and Davis [6] represented actions using 2D [φ1K,φ2m,.e.a,φsuKr]emtoeonbtstain 2 hφ2 ... i motion energy and motion history images from a [Z1(t),Z2(t),..,ZK(t)]. Z (t)= , sequence of human silhouettes. Laptev [7] extracted Thesequence{Z(t)} K hφ i local interest points from a 3-dimensional spatiotem- istheCSvideo. K poral volume, leading to a concise representation Fig.1. CompressiveSensing(CS)ofascene:Everyframe of a video. Wang et al.[8] evaluated various combi- ofthesceneiscompressivelysensedbyopticallycorrelating nations of space-time interest point detectors (Har- randompatternswiththeframetoobtainCSmeasurements. ris3D, Cuboid, Hessian) and several descriptors to ThetemporalsequenceofsuchCSmeasurementsistheCS perform action recognition. The current state-of-the- video. artapproaches[9],[10]toactionrecognitionarebased ondensetrajectories,whichareextractedusingdense in Figure 1. optical flow. The dense trajectories are encoded by Difference between CS and video codecs: It complex, hand-crafted descriptors like histogram of is worth noting at this point that the manner in orientedgradients(HOG)[11],histogramoforiented which compression is achieved by SPCs differs fun- optical flow (HOOF) [12], HOG3D [13], and motion damentally from the manner in which compression boundaryhistograms(MBH)[9].Foradetailedsurvey is achieved in JPEG images or MPEG videos. In the of action recognition, the readers are referred to [14]. case of JPEG, the images are fully sensed and then However,theextractionoftheabovefeaturesinvolves compressed by applying wavelet transform or DCT various non-linear operations. This makes it very to the sensed data, and in the case of MPEG, a video difficult to extract such features from compressively after having been sensed fully is compressed using sensed images. a motion compensation technique. However, in the b) Action recognition in compressed domain: case of SPCs, at the outset one does not have direct Though action recognition has a long history in com- access to full blown images, I(t) . SPCs instead puter vision, little exists in literature to recognize { } provide us with compressed measurements Z(t) actions in the compressed domain. Yeo et al.[15] { } directly by optically calculating inner products of the and Ozer et al.[16] explore compressed domain ac- images, I(t) , with a set of test functions given by tion recognition from MPEG videos by exploiting { } the rows of the measurement matrix, φ, implemented the structure, induced by the motion compensation using a programmable micro-mirror array [2]. While technique used for compression. MPEG compression this helps avoid the storage of a large amount of is done on a block-level which preserves some lo- data and expensive computations for compression, cal properties in the original measurements. How- it often comes at the expense of employing high ever, as stated above, the compression in CS cameras computational load at the central station to recon- is achieved by randomly projecting the individual struct the video data perfectly. Moreover, for perfect frames of the video onto a much lower dimensional reconstruction of the images, given a sparsity level of space and hence does not easily allow leveraging s, state-of-the-art algorithms require O(slog(PQ/s)) motioninformationofthevideo.CSimageryacquires measurements [1], which still amounts to a large global measurements, thereby do not preserve any fraction of the original data dimensionality. Hence, local information in their raw form, making action using SPCs may not always provide advantage with recognition much more difficult in comparison. respecttocommunicationresourcessincecompressive c) Inference problems from CS video: Sankara- measurements and transform coding of data require narayanan et al.[17] attempted to model videos as a comparable bandwidth [3]. However, we show that LDS(LinearDynamicalSystem)byrecoveringparam- it is indeed possible to perform action recognition at eters directly from compressed measurements, but is much higher compression ratios, by bypassing recon- sensitive to spatial and view transforms, making it struction. In order to do this, we propose a spatio- moresuitableforrecognitionofdynamictexturesthan temporal smashed filtering approach, which results action recognition. Thirumalai et al.[18] introduced a in robust performance at extremely high compression reconstruction-free framework to obtain optical flow 3 3DMACHFilter First,thetrainingexamplesareaffine transformedtoacanonicalviewpoint. Next,foreachactionclassasingle Training phase ... composite3Dtemplatecalled‘Ac- tionMACH’filter,whichcaptures intra-classvariabilityissynthesized. Trainingexamples SynthesizeActionMACHfilter CorfKamrnapdmmroeeewmsas(isstiSufvhiuzereoenenucmcPtastieimsQonnenen)tgrsssoait[[fnshφZgt1oe1h,p(tetφthte)i2etc,se,aZtfs.ult.2lv,ly(vφlitidKc)fdr,oee]a.roo.mr,toeZwelaKo.ittbhe(tta)eKi]an,ch FFFK12(((ttt)))===hhh ,,, ... iii}scEuofTcaurhhrcvneehecvlcttaef‘iterctoSomteatnorModmpsr[o,FwAosebar1iSftosato(aHfhtluri)anmsEst,ehe3FDaqdetD2uhKFsb(eeaMtyI-n)mcLl,ceCAoT.eenr.CSE,gr{KFHetcRFhsaKrp’(fim.at(ol)tnten})erdd]ra.ooisnfimsg { Z1(t)= , Thecompressivelysensed hφ i testvideoiscorrelatedwith 1 Scene smashedfiltersforallaction classestoobtainrespec- Z2(t)=hφ2 , ... i tivecorrelationvolumes. Z (t)= , Spatio-temporalSmashedFiltering 3DCorrelationvolume K hφ i K CSmeasurementsoftestvideo Testing phase Fig. 2. Overviewofourapproachtoactionrecognitionfromacompressivelysensedtestvideo.First,MACH[20]filtersfor differentactionsaresynthesizedofflinefromtrainingexamplesandthencompressedtoobtainsmashedfilters.Next,theCS measurementsofthetestvideoarecorrelatedwiththesesmashedfilterstoobtaincorrelationvolumeswhichareanalyzed todeterminetheactioninthetestvideo. based on correlation estimation between two com- ofthetestexamplesarecorrelatedwiththesesmashed pressively sensed images. However, the method does filters. Concisely, smashed filtering hinges on the fact notworkwellatverylowmeasurementrates.Calder- that correlation between a reference signal and an bank et al.[19] theoretically showed that ‘learning input signal is nearly preserved even when they are directly in compressed domain is possible’, and that projected onto a much lower-dimensional space. In with high probability the linear kernel SVM classi- this paper, we show that spatio-temporal smashed fier in the compressed domain can be as accurate filters provide a natural solution to reconstruction- as best linear threshold classifier in the data do- free action recognition from compressive cameras. main. Recently, Kulkarni and Turaga [21] proposed a Our framework (shown in Figure 2) for classification novel method based on recurrence textures for action includes synthesizing Action MACH (Maximum Av- recognition from compressive cameras. However, the erage Correlation Height) filters [20] offline and then method is prone to produce very similar recurrence correlatingthecompressedversionsofthefilterswith textures even for dissimilar actions for CS sequences compressed measurements of the test video, instead and is more suited for feature sequences as in [22]. of correlating raw filters with full-blown video, as is the case in [20]. Action MACH involves synthesizing d) Correlation filters in computer vision: Even asingle3Dspatiotemporalfilterwhichcapturesinfor- though, as stated above, the approaches based on mation about a specific action from a set of training dense trajectories extracted using optical flow in- examples. MACH filters can become ineffective if formation have yielded state-of-the-art results, it is there are viewpoint variations in the training exam- difficult to extend such approaches while dealing ples. To effectively deal with this problem, we also with compressed measurements. Earlier approaches propose a quasi view-invariant solution, which can to action recognition were based on correlation fil- be used even in uncompressed setup. ters, which were obtained directly from pixel data [23], [24], [25], [20], [26], [27]. The filters for different Contributions: 1) We propose a correlation-based actions are correlated with the test video and the framework for action recognition and localization di- responses thus obtained are analyzed to recognize rectly from compressed measurements, thus avoid- and locate the action in the test video. Davenport ing the costly reconstruction process. 2) We provide et al.[28] proposed a CS counterpart of the correla- principledwaystoachievequasiview-invarianceina tion filter based framework for target classification. spatio-temporal smashed filtering based action recog- Here, the trained filters are compressed first to obtain nition setup. 3) We further show that a single MACH ‘smashed filters’, then the compressed measurements filter for a canonical view is sufficient to generate 4 MACH filters for all affine transformed views of the Next,zero-paddingeachframeinH uptoasizeP Q i × canonical view. and changing the indices, (2) can be rewritten as: Outline: Section 2 outlines the reconstruction- N−1Q−1P−1 (cid:88) (cid:88) (cid:88) free framework for action recognition, using spatio- c (l,m,n)= s(α,β,n+t)H (α l,β m,t). i i − − temporal smashed filters (STSF). In section 3, we t=0 β=0α=0 describe a quasi view-invariant solution to MACH (3) basedactionrecognitionbyoutliningasimplemethod This can be written as the summation of N correla- to generate MACH filters for any affine transformed tions in the spatial domain as follows: view.Insection4,wepresentexperimentalresultsob- N−1 (cid:88) tained on three popular action databases, Weizmann, c (l,m,n)= S ,Hl,m,t , (4) i (cid:104) n+t i (cid:105) UCF sports, UCF50 and HMDB51 databases. t=0 where, , denotesthedotproduct,S isthecolumn n+t (cid:104) (cid:105) vector obtained by concatenating the Q columns of 2 RECONSTRUCTION FREE ACTION RECOG- the (n+t)th frame of the test video. To obtain Hl,m,t, i NITION we first shift the tth frame of the zeropadded filter volumeH byl andmunitsinxandy respectivelyto i To devise a reconstruction-free method for action obtain an intermediate frame and then rearrange it to recognition from compressive cameras, we need to a column vector by concatenating its Q columns. Due exploit such properties that are preserved robustly to the distance preserving property of measurement eveninthecompresseddomain.Onesuchpropertyis matrix φ, the correlations are nearly preserved in the the distance preserving property of the measurement muchlowerdimensionalcompresseddomain.Tostate matrixφusedforcompressivesensing[1],[29].Stated the property more specifically, using JL Lemma [29], differently, the correlation between any two signals is the following relation can be shown: nearly preserved even when the data is compressed N−1 (cid:88) toamuchlowerdimensionalspace.Thismakescorre- c (l,m,n) N(cid:15) φS ,φHl,m,t c (l,m,n)+N(cid:15). i − ≤ (cid:104) n+t i (cid:105)≤ i lation filters a natural choice to adopt. 2D correlation t=0 filters have been widely used in the areas of auto- (5) matic target recognition and biometric applications Thederivationofthisrelationandthepreciseformof like face recognition [30], palm print identification (cid:15)isasfollows.Inthefollowing,wederivetherelation [31], etc., due to their ability to capture intraclass between the response volume from uncompressed variabilities. Recently, Rodriguez et al.[20] extended dataandresponsevolumeobtainedusingcompressed thisconceptto3Dbyusingaclassofcorrelationfilters data. According to JL Lemma [29], given 0 < (cid:15) < 1 , called MACH filters to recognize actions. As stated a set of 2N points in RPQ, each with unit norm, S earlier, Davenport et al.[28] introduced the concept and K > (log(N)), there exists a Lipschitz function of smashed filters by implementing matched filters f :RPQ ORK (cid:15)s2uch that → in the compressed domain. In the following section, (1−(cid:15))(cid:107)Sn+t−Hil,m,t(cid:107)2 ≤(cid:107)f(Sn+t)−f(Hil,m,t)(cid:107)2 we generalize this concept of smashed filtering to (1+(cid:15)) S Hl,m,t 2, the space-time domain and show how 3D correlation ≤ (cid:107) n+t− i (cid:107) (6) filterscanbeimplementedinthecompresseddomain for action recognition. and (1 (cid:15)) S +Hl,m,t 2 f(S )+f(Hl,m,t) 2 − (cid:107) n+t i (cid:107) ≤(cid:107) n+t i (cid:107) 2.1 Spatio-temporalsmashedfiltering(STSF) (1+(cid:15)) S +Hl,m,t 2, ≤ (cid:107) n+t i (cid:107) Thissectionformsthecoreofouractionrecognition (7) pipeline, wherein we outline a general method to im- S and Hl,m,t . Now we have: plementspatio-temporalcorrelationfiltersusingcom- ∀ n+t i ∈S pressed measurements without reconstruction and 4(cid:104)f(Sn+t),f(Hil,m,t)(cid:105) subsequently, recognize actions using the response = f(S )+f(Hl,m,t) 2 f(S ) f(Hl,m,t) 2 (cid:107) n+t i (cid:107) −(cid:107) n+t − i (cid:107) volumes. To this end, consider a given video s(x,y,t) (1+(cid:15)) S +Hl,m,t 2 (1+(cid:15)) S Hl,m,t 2 of size P Q R and let H (x,y,t) be the optimal ≥ (cid:107) n+t i (cid:107) − (cid:107) n+t− i (cid:107) × × i =4 S ,Hl,m,t 2(cid:15)( S 2+ Hl,m,t 2) 3D matched filter for actions i = 1,..,NA, with size (cid:104) n+t i (cid:105)− (cid:107) n+t(cid:107) (cid:107) i (cid:107) L M N andN isthenumberofactions.First,the 4 S ,Hl,m,t 4(cid:15). × × A ≥ (cid:104) n+t i (cid:105)− test video is correlated with the matched filters of all (8) actions i = 1,..N to obtain respective 3D response A volumes as in (2). We can get a similar relation for opposite direction, N−1M−1L−1 which when combined with (8), yields the following: (cid:88) (cid:88) (cid:88) c (l,m,n)= s(l+x,m+y,n+t)H (x,y,t). i i S ,Hl,m,t (cid:15) f(S ),f(Hl,m,t) t=0 y=0 x=0 (cid:104) n+t i (cid:105)− ≤(cid:104) n+t i (cid:105) (2) S ,Hl,m,t +(cid:15). (9) ≤(cid:104) n+t i (cid:105) 5 However, JL Lemma does not provide us with a 2.2 Trainingfiltersforactionrecognition embedding, f which satisfies the above relation. As The theory of training correlation filters for any discussed in [32], f can be constructed as a matrix, φ recognition task is based on synthesizing a single with size K PQ, whose entries are either template from training examples, by finding an opti- × • independent realizations of Gaussian random mal tradeoff between certain performance measures. variables or Based on the performance measures, there exist a • independent realizations of Bernoulli random numberofclassesofcorrelationfilters.AMACHfilter ± variables. is a single filter that encapsulates the information Now, if φ constructed as explained above is used as of all training examples belonging to a particular measurement matrix, then we can replace f in (9) by class and is obtained by optimizing four performance φ, leading us to parameters, the Average Correlation Height (ACH), S ,Hl,m,t (cid:15) φS ,φHl,m,t the Average Correlation Energy (ACE), the Average (cid:104) n+t i (cid:105)− ≤(cid:104) n+t i (cid:105) Similarity Measure (ASM), and the Output Noise S ,Hl,m,t +(cid:15). (10) ≤(cid:104) n+t i (cid:105) Variance (ONV). Until recently, this was used only in Hence, we have, two dimensional applications like palm print identifi- N(cid:88)−1 N(cid:88)−1 cation[31],targetrecognition[33]andfacerecognition S ,Hl,m,t N(cid:15) φS ,φHl,m,t (cid:104) n+t i (cid:105)− ≤ (cid:104) n+t i (cid:105) problems[30].Foractionrecognition,Rodriguezetal. t=0 t=0 [20]introducedageneralizedformofMACHfiltersto N(cid:88)−1 synthesize a single action template from the spatio- S ,Hl,m,t +N(cid:15). (11) ≤ (cid:104) n+t i (cid:105) temporal volumes of the training examples. Further- t=0 more, they extended the notion for vector-valued Using equations (4) and (11), we arrive at the follow- data. In our framework for compressive action recog- ing desired equation. nition,weadoptthisapproachtotrainmatchedfilters N−1 for each action. Here, we briefly give an overview of (cid:88) c (l,m,n) N(cid:15) φS ,φHl,m,t c (l,m,n)+N(cid:15). 3D MACH filters which was first described in [20]. i − ≤ (cid:104) n+t i (cid:105)≤ i t=0 First, temporal derivatives of each pixel in the (12) spatio-temporalvolumeofeachtrainingsequenceare Nowallowingfortheerrorincorrelation,wecancom- computed and the frequency domain representation pute the response from compressed measurements as of each volume is obtained by computing a 3D-DFT below: of that volume, according to the following: N−1 (cid:88) N−1M−1 L−1 cciomp(l,m,n)= (cid:104)φSn+t,φHil,m,t(cid:105). (13) F(u)= (cid:88) (cid:88) (cid:88) f(x)e(−j2π(u·x)), (14) t=0 The above relation provides us with the 3D re- t=0 x2=0x1=0 sponse volume for the test video with respect to a where, f(x) is the spatio-temporal volume of L rows, particularaction,withoutreconstructingtheframesof M columnsandN frames,F(u)isitsspatio-temporal the test video. To reduce computational complexity, representation in the frequency domain and x = the 3D response volume is calculated in frequency (x1,x2,t) and u = (u1,u2,u3) denote the indices in domain via 3D FFT. space-time and frequency domain respectively. If Ne Feature vector and Classification using SVM: is the number of training examples for a particular For a given test video, we obtain NA correlation action, then we denote their 3D DFTs by Xi(u),i = volumes.Foreachcorrelationvolume,weadaptthree 1,2,..,Ne, each of dimension, d = L M N. × × level volumetric max-pooling to obtain a 73 dimen- The average spatio-temporal volume of the training sional feature vector [27]. In addition, we also com- set in the frequency domain is given by Mx(u) = pute peak-to-side-lobe-ratio for each of these 73 max- 1 (cid:80)Ne X (u).Theaveragepowerspectraldensityof p,wohoelerde pveaalukkes.isPtShRe iksthgimveanx-pboyolPedSRvkalu=e, paenakdσii−µµki atNhneedttrhai=ien1ainvgeirsaegteissigmivileanribtyymDaxt(ruix)=ofNt1hee(cid:80)trNia=ien1i|nXgi(sue)t|2is, ainnditsσksmaarelltnheeigmhbeaonurahnodods.taTnhduasr,dthdeevfeiaattiuorne vvaelcutoesr gMivAeCnHbyfiSltxe(ruf)or=thN1aet(cid:80)acNit=ieo1n|Xisi(cuo)m−pMuxte(du)b|2y.mNoinwim,tihze- for a given test video is of dimension, NA 146. ing the average correlation energy, average similarity × This framework can be used in any reconstruction- measure, output noise variance and maximizing the free application from compressive cameras which can averagecorrelationheight.Thisisdonebycomputing be implemented using 3D correlation filtering. Here, the following: we assume that there exists an optimal matched filter 1 foreachactionandoutlineawaytorecognizeactions h(u)= M (u), (15) [αC(u)+βD (u)+γS (u)] x from compressive measurements. In the next section, x x we show how these optimal filters are obtained for where,C(u)isthenoisevarianceatthecorresponding each action. frequency. Generally, it is set to be equal to 1 at 6 allfrequencies.Thecorrespondingspace-timedomain a common viewpoint and avoid the merging effect. representation H(x,y,t) is obtained by taking the However, different test videos may be in different inverse 3D DFT of h. A filter with response volume viewpoints, which makes it impractical to synthesize H and parameters α, β and γ is compactly written as filtersforeveryviewpoint.Henceitisdesirablethata H= H,α,β,γ . single representative filter be generated for all affine { } transforms of a canonical view. The following propo- 3 AFFINE INVARIANT SMASHED FILTERING sitionassertsthat,fromaMACHfilterdefinedforthe canonical view, it is possible to obtain a compensated Even though MACH filters capture intra-class vari- MACH filter for any affine transformed view. ations, the filters can become ineffective if viewpoints Proposition 1: Let H = H,α,β,γ denote the of the training examples are different or if the view- { } MACH filter in the canonical view, then for any point of the test video is different from viewpoints arbitrary view V, related to the canonical view of the training examples. Filters thus obtained may by an affine transformation, [Ab], there exists a result in misleading correlation peaks. Consider the MACH filter, Hˆ = Hˆ,αˆ,βˆ,γˆ su|ch that: Hˆ(x ,t) = case of generating a filter of a translational action, ∆2H(Ax +b,t), αˆ{= ∆2α,}βˆ=β and γˆ =γ swhere walking,whereinthetrainingsetissampledfromtwo | | s | | x = (x ,x ) denote the horizontal and vertical axis different views. The top row in Fig 3 depicts some s 1 2 indices and ∆ is the determinant of A. frames of the filter, say ‘Type-1’ filter, generated out Proof: Consider the frequency domain response hˆ of such a training set. The bottom row depicts some for view V, given by the following. frames of the filter, say ‘Type-2’ filter, generated by affine transforming all examples in the training set to hˆ(u)= 1 Mˆ (u). (16) a canonical viewpoint. Roughly speaking, the ‘Type- (αCˆ(u)+βDˆ (u)+γSˆ (u)) x x x Forthesakeofconvenience,weletu=(u ,u )where s 3 Filter without flipping u = (u ,u ) denotes the spatial frequencies and u , s 1 2 3 the temporal frequency. Now using properties of the Fourier transform [34], we have, (a) 1 (cid:88)Ne Filter with flipping Mˆx(us,u3)= Xˆi(us,u3) N e i=1 1 (cid:88)Ne ej2πb·(A−1)TusXi((A−1)Tus,u3) = . N ∆ (b) e i=1 | | Fig.3. a)‘Type-1’filterobtainedforwalkingactionwherethe Using the relation M (u)= 1 (cid:80)Ne X (u), we get, training examples were from different viewpoints b) ‘Type-2’ x Ne i=1 i filter obtained from the training examples by bringing all the torfaihnuinmgaenxmamovpeleisntooptphoesisteamdeirevciteiownpsoiannt.dInev(ean),tutwalolygmroeurgpes Mˆx(us,u3)= ej2πb·(A−1)TusMx((A−1)Tus,u3). (17) ∆ into each other, thus making the filter ineffective. In (b), the | | merging effect is countered by transforming the training set Now, tothesameviewpoint. 2’ filter can be interpreted as many humans walking 1 (cid:88)Ne Dˆ (u ,u )= Xˆ (u ,u )2 in the same direction, whereas the ‘Type-1’ filter, as x s 3 N | i s 3 | e i=1 2 groups of humans, walking in opposite directions. Onecannoticethatsomeoftheframesinthe‘Type-1’ = 1 (cid:88)Ne ej2πb·(A−1)TusXi((A−1)Tus,u3) 2 donotrepresenttheactionofinterest,particularlythe Ne | ∆ | i=1 | | ones in which the two groups merge into each other. This kind of merging effect will become more promi- = 1 (cid:88)Ne Xi((A−1)Tus,u3) 2. nent as the number of different views in the training Ne | ∆ | i=1 | | set increases. The problem is avoided in the ‘Type-2’ Hence, using the relation D (u) = x filter because of the single direction of movement of 1 (cid:80)Ne X (u)2, we have the whole group. Thus, it can be said that the quality Ne i=1| i | of information about the action in the ‘Type-2’ filter 1 Dˆ (u ,u )= D ((A−1)Tu ,u ). (18) is better than that in the ‘Type-1’ filter. As we show x s 3 ∆2 x s 3 | | inexperiments,thisisindeedthecase.Assumingthat Similarly, it can be shown that allviewsofalltrainingexamplesareaffinetransforms of a canonical view, we can synthesize a MACH filter 1 Sˆ (u ,u )= S ((A−1)Tu ,u ). (19) generated after transforming all training examples to x s 3 ∆2 x s 3 | | 7 Using (17), (18) and (19) in (16), we have, Variation of mean PSR as viewing angle changes 6.4 hˆ(u)=(ej2πb·(A−1)TusM ((A−1)Tu ,u ))∆ 6.3 x s 3 1 6.2 . (20) (αˆ|∆|2Cˆ(u)+βˆD ((A−1)Tu ,u )+γˆS ((A−1)Tu ,u ) e6.1 Now letting, αx = αˆ ∆2s, β3 = βˆ,xγ = γˆ, sCˆ(u3) = R valu 6 CCaonmopneicnasla ftieltde rFsilters C(u)=C((A−1)Tus,u3|))|(since C is usually assumed an PS55..89 tobeequalto1atallfrequenciesifnoisemodelisnot e M5.7 available) and using (15), we have, 5.6 5.5 hˆ(u)=∆h((A−1)Tus,u3))ej2πb·(A−1)Tus. (21) 5.40 5 10 15 20 Viewing angle in degrees Now taking the inverse 3D-FFT of hˆ(u), we have, Fig. 4. The mean PSRs for different viewpoints for both canonical filters and compensated filters are shown. The Hˆ(x ,t)= ∆2H(Ax +b,t). (22) s | | s mean PSR values obtained using compensated filters are more than those obtained using canonical filters, thus cor- Thus, a compensated MACH filter for the view V is roboratingtheneedofaffinetransformingtheMACHfiltersto givenbyHˆ = Hˆ,αˆ,βˆ,γˆ .Thiscompletestheproofof theviewpointofthetestexample. { } theproposition.ThusaMACHfilterforviewV,with parameters ∆2α, β and γ can be obtained just by | | affinetransformingtheframesoftheMACHfilterfor 4 EXPERIMENTAL RESULTS the canonical view. Normally ∆ 1 for small view | | ≈ changes.Thus,eventhoughintheory,αˆ isrelatedtoα For all our experiments, we use a measurement byascalingfactorof ∆2,forsmallviewchanges,hˆ is matrix,φwhoseentriesaredrawnfromi.i.d.standard | | theoptimalfilterwithessentiallythesameparameters Gaussian distribution, to compress the frames of the asthoseforthecanonicalview.Thisresultshowsthat test videos. We conducted extensive experiments on for small view changes, it is possible to build robust the widely used Weizmann [35], UCF sports [20], MACH filters from a single canonical MACH filter. UCF50 [37] and HMDB51 [38] datasets to validate Robustness of affine invariant smashed filter- the feasibility of action recognition from compressive ing: To corroborate the need of affine transforming cameras. Before we present the action recognition theMACHfilterstotheviewpointofthetestexample, results, we briefly discuss the baseline methods to we conduct the following two synthetic experiments. whichwecompareourmethod,anddescribeasimple Inthefirst,wetookallexamplesinWeizmanndataset toperformactionlocalizationinthosevideosinwhich and assumed that they belong to the same view, the action is recognized successfully. dubbed as the canonical view. We generated five Baselines: As noted earlier, this is the first paper different datasets, each corresponding to a different to tackle the problem of action recognition from com- viewing angle. The different viewing angles from 0◦ pressive cameras. The absence of precedent approach to 20◦ in increments of 5◦ were simulated by means to this problem makes it difficult to decide on the of homography. For each of these five datasets, a baseline methods to compare with. The state-of-the- recognition experiment is conducted using filters for art methods for action recognition from traditional the canonical view as well as the compensated filters cameras rely on dense trajectories [10], derived using for their respective viewpoints, obtained using (22). highlynon-linearfeatures,HOG[11],HOOF[12],and The average PSR in both cases for each viewpoint is MBH [9]. At the moment, it is not quite clear on showninFigure4.ThemeanPSRvaluesobtainedus- howtoextractsuchfeaturesdirectlyfromcompressed ing compensated filters are more than those obtained measurements. Due to these difficulties, we fixate on using canonical filters. two baselines. The first baseline method is the Oracle In the second experiment, we conducted five in- MACH,whereinactionrecognitionisperformedasin dependent recognition experiments for the dataset [20] and for the second baseline, we first reconstruct corresponding to fixed viewing angle of 15◦, us- theframesfromthecompressivemeasurementsusing ing compensated filters generated for five different CoSaMPalgorithm[39],andthenapplytheimproved viewing angles. The results are tabulated in table 1. dense trajectories (IDT) method [10], which is the It is evident that action recognition rate is highest most stable state-of-the-art method, on the recon- when the compensated filters used correspond to the structed video to perform action recognition. We use viewing angle of the test videos. These two synthetic the code made publicly available by the authors, and experimentsclearlysuggestthatitisessentialtoaffine set all the parameters to default to obtain improved transformthefilterstotheviewpointofthetestvideo dense trajectory features. The features thus obtained before performing action recognition. are encoded using Fisher vectors, and a linear SVM 8 Viewing angle Canonical 5◦ 10◦ 15◦ 20◦ Recognition rate 65.56 68.88 67.77 72.22 66.67 TABLE1 Actionrecognitionratesforthedatasetcorrespondingtofixedviewingangleof15◦usingcompensatedfiltersgeneratedfor variousviewingangles.Asexpected,actionrecognitionrateishighestwhenthecompensatedfiltersusedcorrespondtothe viewingangleofthetestvideos. is used for classification. Henceforth, we refer this not straightforward. The Weizmann dataset contains method as Recon+IDT. 10 different actions, each performed by 9 subjects, Spatial Localization of action from compressive thus making a total of 90 videos. For evaluation, we cameras without reconstruction: Action localization used the leave-one-out approach, where the filters in each frame is determined by a bounding box were trained using actions performed by 8 actors and centred at location (lmax) in that frame, where lmax tested on the remaining one. The results shown in is determined by the peak response (response corre- table 2 indicate that our method clearly outperforms sponding to the classified action) in that frame and theRecon+IDT.Itisquiteevidentthatwithfull-blown the size of the filter corresponding to the classified frames (indicated in table 2) that Recon+IDT method action. To determine the size of the bounding box performs much better than STSF method. However, for a particular frame, the response values inside a at compression ratios of 100 and above, recognition large rectangle of the size of the filter, and centred rates are very stable for our STSF framework, while at lmax in that frame are normalized so that they Recon+IDT fails completely. This is due to the fact sum up to unity. Treating this normalized rectangle that Recon+IDT operates on reconstructed frames, as a 2D probability density function, we determine which are of poor quality at such high compression the bounding box to be the largest rectangle centred ratios, while STSF operates directly on compressed at lmax, whose sum is less than a value, λ 1. For measurements. The recognition rates are stable even ≤ our experiments, we use λ equal to 0.7. at high compression ratios and are comparable to Computationalcomplexity: Inordertoshowthe the recognition accuracy for the Oracle MACH (OM) substantial computational savings achievable in our method [40]. STSF framework of reconstruction-free action recog- nition from compressive cameras, we compare the Compressionfactor STSF Recon+IDT computational time of the framework with that of 1 81.11(3.22s)(OM[40],[20]) 100(3.1s) 100 81.11(3.22s) 5.56(1520s) Recon+IDT. All experiments are conducted on a Intel 200 81.11(3.07s) 10(1700s) i7 quad core machine with 16GB RAM. 300 76.66(3.1s) 10(1800s) Compensated Filters: In section 3, we experi- 500 78.89(3.08s) 7.77(2000s) mentallyshowedthatbetteractionrecognitionresults TABLE2 canbeobtainedifcompensatedfiltersareusedinstead Weizmanndataset:Recognitionratesforreconstruction-free of canonical view filters (table 1). However, to gener- recognitionfromcompressivecamerasfordifferent ate compensated filters, one requires the information compressionfactorsarestableevenathighcompression regarding the viewpoint of the test video. Generally, factorsof500.OurmethodclearlyoutperformsRecon+IDT theviewpointofthetestvideoisnotknown.Thisdif- methodandiscomparabletoOracleMACH[40],[20]. ficulty can be overcome by generating compensated filters corresponding to various viewpoints. In our experiments, we restrict our filters to two viewpoints The average time taken by STSF and Recon+IDT describedinsection3,i.eweuse‘Type-1’and‘Type-2’ to process a video of size 144 180 50 are shown filters. × × in parentheses in table 1. Recon+IDT takes about 20- 35 minutes to process one video, with the frame-wise 4.1 Reconstruction-free recognition on Weiz- reconstructionofthevideobeingthedominatingcom- manndataset ponent in the total computational time, while STSF Even though it is widely accepted in the computer framework takes only a few seconds for the same vision community that Weizmann dataset is an easy sized video since it operates directly on compressed dataset, with many methods achieving near perfect measurements. actionrecognitionrates,webelievethatworkingwith Spatial localization of action from compressive compressed measurements precludes the use of those cameras without reconstruction: Further, to validate well-established methods, and obtaining such high the robustness of action detection using the STSF action recognition rates at compression ratios of 100 framework,wequantifiedactionlocalizationinterms and above even for a simple dataset as Weizmann is of error in estimation of the subject’s centre from its 9 groundtruth.Thesubject’scentreineachframeises- number of videos in the UCF50 dataset. In total, we timated as the centre of the fixed sized bounding box generated 190 filters, as well as their flipped versions with location of the peak response (only the response (‘Type-2’ filters). The UCF50 database consists of 50 correspondingtotheclassifiedaction)inthatframeas actions, with around 120 clips per action, totaling itleft-topcorner.Figure5showsactionlocalizationin upto 6681 videos. The database is divided into 25 a few frames for various actions of the dataset (More groups with each group containing between 4-7 clips action localization results for Weizmann dataset can per action. We use leave-one-group cross-validation be found in supplementary material). Figure 6 shows to evaluate our framework. The recognition rates at that using these raw estimates, on average, the error differentcompressionratios,andthemeantimetaken fromthegroundtruthislessthanorequalto15pixels for one clip (in parentheses) for our framework and in approximately 70% of the frames, for compression Recon+IDTaretabulatedintable5.Table5alsoshows ratiosof100,200and300.Itisworthnotingthatusing therecognitionratesforvariousstate-of-the-artaction our framework it is possible to obtain robust action recognition methods, while operating on the full- localizationresultswithoutreconstructingtheimages, blown images, as indicated in the table by (FBI). Two even at extremely high compression ratios. conclusions follow from the table. 1) Our approach Experiments on the KTH dataset: We also con- outperforms the baseline method, Recon+IDT at very ducted experiments on the KTH dataset [36]. Since high compression ratios of 100 and above, and 2) the the KTH dataset is considered somewhat similar to mean time per clip is less than that for Recon+IDT the Weizmann dataset in terms of the difficulty and method. This clearly suggests that when operating at scale,wehaverelegatedtheactionrecognitionresults high compression ratios, it is better to perform action to the supplement. recognition without reconstruction than reconstruct- ing the frames and then applying a state-of-the-art 4.2 Reconstruction-free recognition on UCF method. The recognition rates for individual classes sportsdataset for Oracle MACH (OM), and compression ratios, 100 and 400 are given in table 6. The action localization The UCF sports action dataset [20] contains a total results for various actions are shown in figure 8. of 150 videos across 9 different actions. The dataset The bounding boxes in most instances correspond to is a challenging dataset with scale and viewpoint the human or the moving part of the human or the variations. For testing, we use leave-one-out cross object of interest. Note how the sizes of the bounding validation. At compression ratio of 100 and 300, the boxes are commensurate with the area of the action recognitionratesare70.67%and68%respectively.The in each frame. For example, for the fencing action, rates obtained are comparable to those obtained in the bounding box covers both the participants, and Oracle MACH set-up [20] (69.2%). Considering the fortheplayingpianoaction,theboundingboxcovers difficultyofthedataset,theseresultsareveryencour- just the hand of the participant. In the case of breast- aging. The confusion matrices for compression ratios stroke action, where human is barely visible, action 100 and 300 are shown in tables 3 and 4 respectively. localizationresultsareimpressive.Weemphasizethat Spatial localization of action from compres- action localization is achieved directly from compres- sivecameraswithoutreconstruction: Figure7shows sive measurements without any intermediate recon- action localization for some correctly classified in- struction, even though the measurements do not bear stances across various actions in the dataset, for Ora- anyexplicitinformationregardingpixellocations.We cle MACH and compression ratio = 100 (More action notethattheprocedureoutlinedaboveisbynomeans localization results can be found in supplementary a full-fledged procedure for action localization and material). It can be seen that action localization is es- is fundamentally different from the those in [41], timated reasonably well despite large scale variations [42], where sophisticated models are trained jointly and extremely high compression ratio. on action labels and the location of person in each frame, and action and its localization are determined 4.3 Reconstruction-free recognition on UCF50 simultaneouslybysolvingonecomputationallyinten- dataset siveinferenceproblem.Whileourmethodissimplistic To test the scalability of our approach, we conduct in nature and does not always estimate localization action recognition on large datasets, UCF50 [37] and accurately, it relies only on minimal post-processing HMDB51 [38]. Unlike the datasets considered earlier, of the correlation response, which makes it an at- these two datasets have large intra-class scale vari- tractive solution for action localization in resource- ability. To account for this scale variability, we gener- constrained environments where a rough estimate of ate about 2-6 filters per action. To generate MACH action location may serve the purpose. However, we filters, one requires bounding box annotations for donotethatactionlocalizationisnottheprimarygoal the videos in the datasets. Unfortunately frame-wise ofthepaperandthatthepurposeofthisexerciseisto bounding box annotations are not available for these showthatreasonablelocalizationresultsdirectlyfrom two datasets. Hence, we manually annotated a large compressivemeasurementsarepossible,evenusinga 10 (a) (b) (c) Fig. 5. Spatial localization of subject without reconstruction at compression ratio = 100 for different actions in Weizmann dataset.a)Walkingb)Twohandedwavec)Jumpinplace Jumping Jack Walk One handed wave Two handed wave 1 1 1 1 Percentage of number of frames0000....2468 CCCCRRRR ==== 123500000000 Percentage of number of frames0000....2468 CCCCRRRR ==== 123500000000 Percentage of number of frames0000....2468 CCCCRRRR ==== 123500000000 Percentage of number of frames0000....2468 CCCCRRRR ==== 123500000000 00 5 10 15 20 25 30 00 5 10 15 20 25 30 00 5 10 15 20 25 30 00 5 10 15 20 25 30 Displacement from ground truth Displacement from ground truth Displacement from ground truth Displacement from ground truth (a) (b) (c) (d) Fig.6. LocalizationerrorforWeizmanndataset.X-axis:Displacementfromgroundtruth.Y-axis:Fractionoftotalnumberof framesforwhichthedisplacementofsubject’scentrefromgroundtruthislessthanorequaltothevalueinx-axis.Onaverage, forapproximately70%oftheframes,thedisplacementofgroundtruthislessthanorequalto15pixels,forcompressionratios of100,200and300. Action Golf-Swing Kicking RidingHorse Run-Side Skate-Boarding Swing Walk Diving Lifting Golf-Swing 77.78 16.67 0 0 0 0 5.56 0 0 Kicking 0 75 0 5 5 10 5 0 0 RidingHorse 16.67 16.67 41.67 8.33 8.33 0 8.33 0 0 Run-Side 0 0 0 61.54 7.69 15.38 7.69 7.69 0 Skate-Boarding 0 8.33 8.33 25 50 0 5 0 0 Swing 0 3.03 12.12 0.08 3.03 78.79 3.03 0 0 Walk 0 9.09 4.55 4.55 9.09 9.09 63.63 0 0 Diving 0 0 0 0 7.14 0 0 92.86 0 Lifting 0 0 0 0 0 0 0 16.67 83.33 TABLE3 ConfusionmatrixforUCFsportsdatabaseatacompressionfactor=100.Recognitionrateforthisscenariois70.67%, whichiscomparabletoOracleMACH[20](69.2%). rudimentaryprocedureasoutlinedabove.Thisclearly 4.4 Reconstruction-free recognition on HMDB51 suggests that with more sophisticated models, better dataset reconstruction-free action localization results can be The HMDB51 database consists of 51 actions, with achieved. One possible option is to co-train models around120clipsperaction,totallingupto6766videos. jointlyonactionlabelsandannotatedboundingboxes The database is divided into three train-test splits. in each frame similar to [41], [42], while extracting The average recognition rate across these splits is spatiotemporal features such as HOG3D [13] features reported here. For HMDB51 dataset, we use the same for correlation response volumes, instead of the input filters which were generated for UCF50 dataset. The video. recognition rates at different compression ratios, and