ebook img

DTIC ADA584419: High-Level Event Recognition in Unconstrained Videos PDF

1.9 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview DTIC ADA584419: High-Level Event Recognition in Unconstrained Videos

IntJMultimedInfoRetr(2013)2:73–101 DOI10.1007/s13735-012-0024-2 TRENDS AND SURVEYS High-level event recognition in unconstrained videos Yu-GangJiang · SubhabrataBhattacharya · Shih-FuChang · MubarakShah Received:10September2012/Accepted:15October2012/Publishedonline:13November2012 ©Springer-VerlagLondon2012 Abstract The goal of high-level event recognition is to Keywords Video events · Recognition ·Unconstrained automatically detect complex high-level events in a given videos·Multimediaeventdetection·Multimodalfeatures· video sequence. This is a difficult task especially when Fusion videosarecapturedunderunconstrainedconditionsbynon- professionals. Such videos depicting complex events have limited quality control, and therefore, may include severe 1 Introduction cameramotion,poorlighting,heavybackgroundclutter,and occlusion. However, due to the fast growing popularity of High-levelvideoeventrecognitionistheprocessofautomat- suchvideos,especiallyontheWeb,solutionstothisproblem icallyidentifyingvideoclipsthatcontaineventsofinterest. are in high demands and have attracted great interest from The high-level or complex events—by our definition—are researchers.Inthispaper,wereviewcurrenttechnologiesfor long-term spatially and temporally dynamic object interac- complex event recognition in unconstrained videos. While tions that happen under certain scene settings. Two popu- theexistingsolutionsvary,weidentifycommonkeymodules larcategoriesofcomplexeventsareinstructionalandsocial and provide detailed descriptions along with some insights events.Theformerincludesproceduralvideos(e.g.,“mak- foreachofthem,includingextractionandrepresentationof ing a cake”, “changing a vehicle tire”), while the latter low-levelfeaturesacrossdifferentmodalities,classification includes social activities (e.g., “birthday party”, “parade”, strategies, fusion techniques, etc. Publicly available bench- “flash mob”). Techniques for recognizing such high-level mark datasets, performance metrics, and related research events areessentialfor many practical applications such as forums are also described. Finally, we discuss promising Webvideosearch,consumervideomanagement,andsmart directionsforfutureresearch. advertising. The focus of this work is to address the issues related tohigh-leveleventrecognition.Events,actions,interactions, B activities,andbehaviorshavebeenusedinterchangeablyin Y.-G.Jiang( ) theliterature[1,15],andthereisnoagreementontheprecise SchoolofComputerScience,FudanUniversity,Shanghai,China e-mail:[email protected] definition of each term. In this paper, we attempt to pro- vide a hierarchical model for complex event recognition in S.Bhattacharya·M.Shah Fig.1.Movementisthelowestleveldescription:“anentity ComputerVisionLab,UniversityofCentralFlorida, (e.g.hand)ismovedwithlargedisplacementinrightdirec- Orlando,FL,USA e-mail:[email protected] tion with slow speed”. Movements can also be referred as attributes which have been recently used in human action M.Shah e-mail:[email protected] recognition[73]followingtheirsuccessfuluseinfacerecog- nitioninasingleimage.Nextareactivitiesoractions,which S.-F.Chang aresequencesofmovements(e.g.“handmovingtorightfol- DepartmentofElectricalEngineering,ColumbiaUniversity, lowedbyhandmovingtoleft”,whichisa“waving”action). NewYork,NY,USA e-mail:[email protected] Anactionhasamoremeaningfulinterpretationandisoften 123 Report Documentation Page Form Approved OMB No. 0704-0188 Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information, including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington VA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if it does not display a currently valid OMB control number. 1. REPORT DATE 3. DATES COVERED 2013 2. REPORT TYPE 00-00-2013 to 00-00-2013 4. TITLE AND SUBTITLE 5a. CONTRACT NUMBER High-level event recognition in unconstrained videos 5b. GRANT NUMBER 5c. PROGRAM ELEMENT NUMBER 6. AUTHOR(S) 5d. PROJECT NUMBER 5e. TASK NUMBER 5f. WORK UNIT NUMBER 7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) 8. PERFORMING ORGANIZATION Columbia University,Department of Electrical Engineering,New REPORT NUMBER York,NY,10027 9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSOR/MONITOR’S ACRONYM(S) 11. SPONSOR/MONITOR’S REPORT NUMBER(S) 12. DISTRIBUTION/AVAILABILITY STATEMENT Approved for public release; distribution unlimited 13. SUPPLEMENTARY NOTES 14. ABSTRACT The goal of high-level event recognition is to automatically detect complex high-level events in a given video sequence. This is a difficult task especially when videos are captured under unconstrained conditions by nonprofessionals. Such videos depicting complex events have limited quality control, and therefore, may include severe camera motion, poor lighting, heavy background clutter, and occlusion. However, due to the fast growing popularity of such videos, especially on theWeb, solutions to this problem are in high demands and have attracted great interest from researchers. In this paper, we review current technologies for complex event recognition in unconstrained videos. While the existing solutions vary,we identify common key modules and provide detailed descriptions along with some insights for each of them, including extraction and representation of low-level features across different modalities, classification strategies, fusion techniques, etc. Publicly available benchmark datasets, performance metrics, and related research forums are also described. Finally, we discuss promising directions for future research. 15. SUBJECT TERMS 16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF 18. NUMBER 19a. NAME OF ABSTRACT OF PAGES RESPONSIBLE PERSON a. REPORT b. ABSTRACT c. THIS PAGE Same as 29 unclassified unclassified unclassified Report (SAR) Standard Form 298 (Rev. 8-98) Prescribed by ANSI Std Z39-18 74 IntJMultimedInfoRetr(2013)2:73–101 Fig. 1 Ataxonomyofsemanticcategoriesinvideos,withincreased High-levelevents(focusofthispaper)lieontopofthehierarchy,which complexityfrombottomtotop.Attributesarebasiccomponents(e.g., contain(normallymultiple)complexactionsandinteractionsevolving movements)ofactions,whileactionsarekeyelementsofinteractions. overtime performedbyentities(e.g.,human,animal,andvehicle).An inasinglehierarchy,becauseofthepolysemousnatureofthe actioncanalsobeperformedbetweentwoormoreentities, words,adoptingthesameterminologiesintheresearchcom- which is commonly referred to as an interaction (e.g., per- munityisanimpossibleobjectivetoachieve. son lifts an object, person kisses another person, car enters Havingsaidthat,wesetthecontextofeventrecognitionas facility,etc.).Motionverbscanalsobeusedtodescribeinter- thedetectionoftemporalandspatiallocationsofthecomplex actions.RecentlytheMind’seyedatasetisreleasedundera eventinthevideosequence.Inasimplifiedcasewhentem- DARPAprogramwhichcontainsmanymotionverbssuchas poralsegmentationofvideointoclipshasbeenachieved,or “approach”,“lift”,etc[11].Inthishierarchy,conceptsspan whereeachvideocontainsonlyoneeventandprecisespatial across both actions and interactions. In general, concept is localizationisnotimportant;itreducestoavideoclassifica- a loaded word, which has been used to represent objects, tionproblem. scenes,andevents,suchasthosedefinedinlarge-scalecon- Whilemanyexistingworkshaveonlyemployedthevisual ceptontologyformultimedia(LSCOM)[95].Finally,atthe modality for event recognition, it is important to empha- top level of the hierarchy, we have complex or high-level sizethatvideoanalysisisintrinsicallymultimodal,demand- events that have larger temporal durations and consist of a ingmultidisciplinaryknowledgeandtoolsfrommanyfields, sequenceofinteractionsorstand-aloneactions,e.g.,anevent suchascomputervision,audioandspeechanalysis,multime- “changingavehicletire”containsasequenceofinteractions dia,andmachinelearning.Todealwithlargescaledatathat suchas“personopeningtrunk”and“personusingwrench”, iscommonnowadays,scalableindexingmethodsandparal- followed by actions such as “squatting” and so on. Simi- lelcomputationalplatformsarealsobecominganimportant larly, another complex event such as “birthday party” may partofmodernvideoanalysissystems. involveactionslike“personclapping”and“personsinging”, There exist many challenges in developing automatic followed by interactions like “person blowing candle” and videoeventrecognitionsystems.Onewell-knownchallenge “personcuttingcake”.Notethatalthoughwehaveattempted isthelong-standingsemanticgapbetweencomputablelow- toencapsulatemostsemanticcomponentsofcomplexevents level features (e.g., visual, audio, and textual features) and 123 IntJMultimedInfoRetr(2013)2:73–101 75 semanticinformationthattheyencode(e.g.,thepresenceof 1.1 Relatedreviews meaningfulclassessuchas“apersonclapping”,“soundofa crowdcheering”,etc.)[129].Currentapproachesheavilyrely There have been several related papers that review the onclassifier-basedmethodsemployingdirectlycomputable researchofvideocontentrecognition.Mostofthemfocused features.Inotherwords,theseclassifiersattempttoestablish onhumanaction/activityanalysis,e.g.,[1]byAggarwaland acorrespondencebetweenthecomputedfeaturesoraquan- Ryoo,[111]Poppeand[139]Turagaetal.,wherelow-level tized layer on the features to the actual label of the event features,representations,classificationmodels,anddatasets depictedinthevideo.Indoingso,theylackasemantically werecomprehensivelysurveyed.Whilemosthumanactivity meaningful, yet conceptually abstract, intermediate repre- researchwasdoneonconstrainedvideoswithlimitedcontent sentationofthecomplexevent,whichcanbeusedtoexplain (e.g.,cleanbackgroundandnocameramotion),recentworks whataparticulareventis,andhowsuchrepresentationcan havealsoshiftedfocustotheanalysisofrealisticvideossuch be used to recognize other complex events. This is why, as user-uploaded videos on the Internet, or broadcast, and withmuchprogressmadeinthepastdecadeinthiscontext, documentaryvideos. the computational approaches involved in complex event In[130],SnoekandWorringsurveyedapproachestomul- recognition are reliable only under certain domain-specific timodal video indexing, focusing on methods for detect- constraints. ing various semantic concepts consisting of mainly objects Moreover,withthepopularityofhandheldvideorecord- and scenes. They also discussed video retrieval techniques ing devices, the issue becomes more serious since a huge exploringconcept-basedindexing,wherethemainapplica- amount of videos are currently being captured by non- tion data domains were broadcast news and documentary professional users under unconstrained conditions with videos. Brezeale and Cook [17] surveyed text, video, and limited quality control (for example, in contrast to videos audiofeaturesforclassifyingvideosintoapredefinedsetof from broadcast news, documentary, or controlled surveil- genres,e.g.,“sports”or“comedy”.Morsilloetal.[94]pre- lance).Thisamplifiesthesemanticgapchallenge.However, sented a brief review that focused on efficient and scalable it also opens a great opportunity since the proliferation of methodsforannotatingWebvideosatvariouslevelsinclud- such user-generated videos has greatly contributed to the ingobjects,scenes,actions,andhigh-levelevents.Laveeet rapidlygrowingdemandsfornewcapabilitiesinrecognition al.[67]reviewedeventmodelingmethods,mostlyinthecon- ofhigh-leveleventsinvideos. textofsimplehumanactivityanalysis.Areviewmorerelated In this paper, we will first discuss the current popular tothispaperistheonebyBallanetal.[8],whichdiscussed methods for high-level video event recognition from mul- features and models for detecting both simple actions and timedia data. We will review multimodal features, models, complexeventsinvideos. and evaluation strategies that have been widely studied by Differentfromtheexistingsurveysmentionedabove,this manygroupsintherecentliterature.Comparedtoafewexist- paper concentrates on the recognition of high-level com- ingsurveypapersinthisarea(assummarizedinthefollow- plexeventsfrommultimediadata,suchasthosementioned ingsubsection),thispaperhasaspecialfocusonhigh-level inFig.1.Manytechniquesforrecognizingobjects,scenes, events.Wewillprovidein-depthdescriptionsoftechniques and human activities will be discussed to the extent that is thathavebeenshownpromisinginrecentbenchmarkevalua- neededforunderstandinghigh-leveleventrecognition.How- tionactivitiessuchasthemultimediaeventdetection(MED) ever, providing full coverage of those topics is beyond the task[99]ofNISTTRECVID.1Additionally,wewilldiscuss scopeofthiswork. severalimportantrelatedissuessuchasthedesignsofeval- uationbenchmarksforhigh-leveleventrecognition.Finally, wewillidentifypromisingdirectionsforfutureresearchand 1.2 Outline developments. To stimulate further research, at the end of each important section we provide comments summarizing We first organize the research of high-level event recogni- the issues with the discussed approaches and insights that tionintoseveralkeydimensions(showninFig.2),basedon maybeusefulforthedevelopmentoffuturehigh-levelevent whichthepaperwillbestructured.Whilethedesignofareal recognitionsystems. system may depend on application requirements, key com- ponents like those identified (feature extraction and recog- nitionmodel)areessential.Thesecoretechnologieswillbe discussedinSects.2and3.InSect.4,wediscussadvanced 1 TRECvideoretrievalevaluation(TRECVID)[128]isanopenforum issuesbeyondsimplevideo-levelclassification,suchastem- for promoting and evaluating new research in video retrieval. It fea- porallocalization ofevents, textualrecounting ofdetection turesabenchmarkactivitysponsoredannually,since2001,bytheUS results,andtechniquesforimprovingrecognitionspeedand National Institute of Standards and Technology (NIST). See http:// trecvid.nist.govformoredetails. dealingwithlarge-scaledata.Tohelpstimulatenewresearch 123 76 IntJMultimedInfoRetr(2013)2:73–101 Fig. 2 Overviewofvarious aspectsofthevideoevent recognitionresearch. Presentationofthepaperis structuredbasedonthis organization.Numbers correspondtothesections coveringthetopics activities, we present reviews of popular benchmarks and 2.1 Frame-basedappearancefeatures exploreinsightsofseveraltop-performingsystemsinrecent evaluationforumsinSect.5.Finally,wediscusspromising Appearance-based features are computed from a single directionsforfutureresearchinSect.6andpresentconclu- frame.Theydonotconsiderthetemporaldimensionofvideo sionsinSect. 7. sequencesbutarewidelyusedinvideoanalysissincetheyare relativelyeasytocomputeandhavebeenshowntoworkwell inpractice.Therehasbeenaveryrichknowledgebaseand 2 Featurerepresentations extensivepubliclyavailableresources(publictools)devoted tostaticvisualfeatures.Wedivideexistingworksintolocal Features play a critical role in video event analysis. Good andglobalfeatures,aswillbediscussedinthefollowing. featuresareexpectedtoberobustagainstvariationssothat videosofthesameeventclassunderdifferentconditionscan stillbecorrectlyrecognized.Therearetwomainsourcesof 2.1.1 Localfeatures information that can be exploited. The visual channel, on onehand,depictsappearanceinformationrelatedtoobjects, A video frame can be represented efficiently using a set of scene settings, while on the other hand, captures motion discriminativelocalfeaturesextractedfromit.Theextraction information pertaining to the movement of the constituent oflocalfeaturesconsistsoftwosteps:detectionanddescrip- objects and the motion of the camera. The second is the tion.Detectionreferstotheprocessoflocatingstablepatches acoustic channel, which may contain music, environmen- thathavesomedesirablepropertieswhichcanbeemployed talsoundsand/orconversations.Bothchannelsconveyuse- tocreatea“signature”ofanimage.Inpractice,uniformand fulinformation,andmanyvisualandacousticfeatureshave densesamplingofimagepatches,withsomeobviousstorage beendevised.Wediscussstaticframe-basedvisualfeatures overhead is often used in comparison to the rather compu- in Sect. 2.1, spatio-temporal visual features in Sect. 2.2, tationally expensive, less storage intensive patch detection acousticfeaturesinSect.2.3,audio-visualjointrepresenta- [101]. tionsinSect.2.4,andfinally,thebag-of-featuresframework, Amongpopularlocalpatch(a.k.a.interestpoint)detection whichconvertsaudio/visualfeaturesintofixeddimensional algorithms,themostwidelyusedoneisLowe’sDifference- vectorsinSect.2.5. of-Gaussian (DoG) [77], which detects blob regions where 123 IntJMultimedInfoRetr(2013)2:73–101 77 of SIFT, called gradient location and orientation histogram (GLOH), was proposed in [89], to use a log-polar location gridinsteadoftheoriginalrectangulargridinSIFT.Workin [119]studiedcolordescriptorsthatincorporatedcolorinfor- mationintotheintensity-basedSIFTforimprovedobjectand scenerecognition.Theyreportedaperformancegainof8% on PASCAL VOC 2007 dataset.2 Further, to improve the computationalefficiency,Bayetal.[12]developedSURFas Fig. 3 Example results of local detectors: Harris–Laplace (left) and afastalternativedescriptorusing2DHaarwaveletresponses. DoG(right).Theimagesareobtainedfrom[30] Several other descriptors have also been popular in this context. Histogram of oriented gradients (HOG) was pro- posedbyDalalandTriggs[27]tocaptureedgedistributions the center differsfromthe surrounding area. Other popular inimagesorvideoframes.Localbinarypattern(LBP)[103] detectors include Harris–Laplace [72], Hessian [88], maxi- isanothertexturefeaturewhichusesbinarynumberstolabel mallystableextremalregions(MSERs)[85],etc.Harrisand each pixel of a frame by comparing its value to that of its Hessian focus on detection of corner points. MSER is also neighborhoodpixels. for blob detection, but relies on a different scheme. Unlike DoGwhichdetectslocalmaximumsinmulti-scaleGaussian 2.1.2 Globalfeatures filteredimages,MSERfindsregionswhosesegmentationis stableoveralargerangeofthresholds.Figure3showsexam- In earlier works global representations were employed, ple results of two detectors. Interested readers are referred which encode a whole image based on the overall distrib- to [90] for a comprehensive review of several local patch ution of color, texture, or edge information. Popular ones detectors.Althoughitisobservedthatdensesamplingelimi- include color histogram, color moments [166], and Gabor natestherequirementofdetectors,recentexperimentsshown texture [83]. Oliva and Torralba [104] proposed a very low in [34] confirm that methods using both strategies (sparse dimensional scene representation which implicitly encodes detectionanddensesampling)offerthebestperformancein perceptual naturalness, openness, roughness, expansion, visualrecognitiontasks.Inthisdirection,Tuytlaarsproposed ruggedness using spectral and coarsely localized informa- a hybrid selection strategy [140] where the author demon- tion. Since this represents the dominant spatial structure of stratedhowtheadvantagesofbothsamplingschemescanbe a scene, it is referred to as the GIST descriptor. Most of efficientlycombinedtoimproverecognitionperformance. theseglobalfeaturesadoptgrid-basedrepresentationswhich Once local patches are identified, the next stage is to takespatialdistributionofthesceneintoaccount(e.g.,“sky” describe them in a meaningful manner so that the resulted alwaysappearsabove“road”).Featuresarecomputedwithin descriptors are (partially) invariant to rotation, scale, view- eachgridseparatelyandthenconcatenatedasthefinalrepre- point, and illumination changes. Since the descriptors are sentation.Thissimplestrategyhasbeenshowntobeeffective computedfromsmallpatchesascomparedtoawholeframe, forvariousimage/videoclassificationtasks. theyarealsosomewhatrobusttopartialocclusionandback- Summary Single-frame based feature representations— groundclutter. suchasSIFT,GIST,HOG,etc.—arethemoststraightforward Many descriptors have been designed over the years. to compute and have low-complexity. These features have The best-known is scale-invariant feature transform (SIFT) beenshowntobeextremelydiscriminativeforvideosthatdo [77], which partitions a patch into equal-sized grids, each notdepictrapidinter-framechanges.Forvideoswithrapid describedbyahistogramofgradientorientations.Akeyidea contentchanges,oneneedstocarefullysampleframesfrom ofSIFTisthatapatchisrepresentedrelativetoitsdominant whichthesefeaturescanbeextractedifnotalltheframesare orientation,whichprovidesanicepropertyofrotationinvari- used.Sinceanoptimalkeyframeselectionstrategyisyettobe ance.SIFT,coupledwithseverallocaldetectorsintroduced developed,researchersinpracticesampleframesuniformly. above, has been among the most popular choices in recent Alow-samplingratecouldleadtolossofvitalinformation, videoeventrecognitionsystems[10,58,96,98]. while high sampling rates result in redundancies. Further- SIFThasbeenextendedinvariousways.PCA-SIFTwas more, these features do not include temporal information, proposedbyKeetal.[60],whoappliedprincipalcomponent andhencetheyareineffectiveinrepresentingmotion,avery analysis (PCA) to reduce the dimensions of SIFT. It stated thatPCA-SIFTisnotonlycompactbutalsomorerobustsince 2 PASCAL visual object class (VOC) challenge is an annual PCAmayhelpreducenoiseintheoriginalSIFTdescriptors. benchmarkcompetitiononimage-basedobjectrecognition,supported However, such performance gains of PCA-SIFT were not byEU-fundedPASCAL2NetworkofExcellenceonPatternAnalysis, foundinthecomparativestudyin[89].Animprovedversion StatisticalModelingandComputationalLearning. 123 78 IntJMultimedInfoRetr(2013)2:73–101 importantsourceofinformationinvideos.Thismotivatesus conductedacomparativestudyonspatio-temporallocalfea- tomoveontothenextsectionthatdiscussesspatio-temporal turesandfoundthatdensesamplingworksbetterthansparse (motion)features. detectorsSTIP,Cuboid,andHessian,particularlyonvideos capturedunderrealisticsettings(incontrasttothosetakenin 2.2 Spatio-temporalvisualfeatures constrainedenvironmentwithcleanbackground).Thedense sampling, however, requires a much larger number of fea- Different from frame-based features, spatio-temporal fea- turestoachieveagoodperformance. turestakethetimedimensionofvideosintoaccount,which Like the 2D local features, we also need descriptors to isintuitivelyappealingsincetemporalmotioninformationis encodethe3Dspatio-temporalpoints(volumes).Mostexist- criticalforunderstandinghigh-levelevents. ing 3D descriptors are motivated from those designed for the2Dfeatures.Dollaretal.[29]testedsimpleflatteningof 2.2.1 Spatio-temporallocalfeatures intensityvaluesinacuboidaroundaninterestpoint,aswell asglobalandlocalhistogramsofgradientsandopticalflow. Many spatio-temporal video features have been proposed. SIFT[77]wasextendedto3DbyScovanneretal.[122],and Apart from several efforts in designing global spatio- SURF[12]wasadaptedto3DbyKnoppetal.[62].Laptev temporal representations, a more popular direction is to et al. [66] used grid-based (by dividing a 3D volume into extendtheframe-basedlocalfeaturestoworkin3D(x,y,t), multiple grids) HOG and histogram of optical flow (HOF) namelyspatio-temporaldescriptors.In[65],Laptevextended todescribeSTIPs[65],andfoundtheconcatenationofHOG the Harris corner patch detector [72] to locate spatial- and HOF descriptors very effective. Biologically the com- temporalinterestpoints(STIPs),whicharespace-timevol- binationofthetwodescriptorsalsomakesgoodsensesince umes in which pixel values have significant variations in HOGencodesappearanceinformationwhileHOFcaptures bothspaceandtime.Figure4givestwoexampleresultsof motionclue.Klaseretal.[61]alsoextendedHOGto3Dand STIPdetection.Aswillbediscussedinlatersections,STIP proposedtoutilizeintegralvideosforfastdescriptorcompu- has been frequently used in recent video event recognition tation.Theself-similaritiesdescriptorwasadaptedbyJunejo systems. et al. [123] for cross-view action recognition. Recently, Several alternatives of STIP have been proposed. In Taylor et al.[135] proposed to use convolutional neural Dollaretal.[29]proposedtouseGabor filtersfor3Dkey- networks to implicitly learn spatio-temporal descriptors, pointdetection.Thedetector,calledCuboid,findslocalmax- andobtainedsimilarhumanactionrecognitionperformance ima of a response function that contains a 2D Gaussian comparable to the STIP detector paired with HOG-HOF smoothingkerneland1DtemporalGaborfilters.Rapantzikos descriptors. Le et al. [69] combined independent subspace et al. [112] used saliency to locate spatio-temporal points, analysis (ISA) with ideas from convolutional neural net- where the saliency is computed by a global minimization works to learn invariant spatio-temporal features. Better process which leverages spatial proximity, scale, and fea- resultsfromtheISAfeaturesoverthestandardSTIPdetector ture similarity. To compute the feature similarity, they also andHOG-HOFdescriptorswereachievedonseveralaction utilized color, in addition to intensity and motion that are recognitionbenchmarks[69]. commonly adopted in other detectors. Moreover, Willems etal.[158]usedthedeterminantofaHessianmatrixasthe 2.2.2 Trajectorydescriptors saliency measure, which can be efficiently computed using box-filter operations on integral videos. Wang et al. [151] Spatio-temporalinformationcanalsobecapturedbytrack- ing the frame-based local features. These descriptors are theoreticallysuperiortodescriptorssuchasHOG-HOF,3D SURF,DollarCuboids,etc.Thisisbecausetheyrequirethe detectionofadiscriminativepointorregionoverasustained periodoftime,unlikethelatterthatcomputesvariouspixel- based statistics subjected to a predefined spatio-temporal neighborhood (typically 50 × 50 × 20 pixels). However, computation of trajectory descriptors requires substantial computational overhead. The first of its kind was proposed by Wang et al. [149] where the authors used the well- knownKanade–Lucas–Tomasi(KLT)tracker[79]toextract DoG-SIFTkey-pointtrajectories,andcomputeafeatureby Fig. 4 ResultsofSTIPdetectionusingasyntheticsequence(left)anda realisticvideo(right;thedetectedpointsareshownonanimageframe). modeling the motion between every trajectory pair. Sun et Theimagesarereprintedfrom[65](©2005Springer-Verlag) al. [132] also applied KLT to track DoG-SIFT key-points. 123 IntJMultimedInfoRetr(2013)2:73–101 79 Different from [149], they computed three levels of trajec- havebeenshowntoachieveperformancegainsatthecostof torycontext,includingpoint-levelcontextwhichisanaver- thecomputingoverhead. agedSIFTdescriptor,intra-trajectorycontextwhichmodels trajectorytransitionsovertime,andinter-trajectorycontext 2.3 Acousticfeatures whichencodesproximitiesbetweentrajectories.Theveloc- ityhistoriesofkey-pointtrajectoriesaremodeledbyMessing Acousticinformationisvaluableforvideoanalysis,particu- etal.[87],whoobservedthatvelocityinformationisuseful larlywhenthevideosarecapturedunderrealisticanduncon- for detecting daily living actions in high-resolution videos. strained environments. Mel-frequency cepstral coefficients Uemuraetal.[141]combinedfeaturetrackingandframeseg- (MFCC) is one of the most popular acoustic features for mentation to estimate dominant planes in the scene, which soundclassification[7,33,163].MFCCrepresentstheshort- were used for motion compensation. In Yuan et al. [171] term power spectrum of an audio signal, based on a linear clusteredkey-pointtrajectoriesbasedonspatialproximities cosinetransformofalogpowerspectrumonanonlinearmel and motion patterns. Like [141], this method extracts rela- scale of frequency. In Xu et al. [163] used MFCC together tivefeaturesfromclustersoftrajectoriesonthebackground with another popular feature zeros crossing rate (ZCR) for that describe the motion differently from those emanating audioclassification.Predictionsofaudiocategoriessuchas fromtheforeground.Asaresult,theeffectofcameramotion “whistling” and “audience sound” are used for detecting canbealleviatedusingthisapproach.Inaddition,Raptisand high-levelsportseventslike“foul”,“goal”,etc.Baillieand Soatto[113]proposedtracklet,whichdiffersfromthelong- Jose[7]usedasimilarframework,butwithMFCCfeatures term trajectories by capturing the local casual structure of alone,foraudio-basedeventrecognition. actionelements.AmorerecentworkbyWuetal.[159]used Eronenetal.[33]evaluatedmanyaudiofeatures.Usinga Lagrangianparticletrajectoriesanddecomposedthetrajec- dataset of realistic audio contexts (e.g., “road”, “supermar- toriesintocamera-inducedandobject-inducedcomponents, ket”and“bathroom”),theyfoundthatthebestperformance whichmakestheirmethodrobusttocameramotion.Wanget wasachievedbyMFCC.Inadifferentvein,Pattersonetal. al.[150]performedtrackingondensepatches.Theyshowed [107]proposedtheauditoryimagemodel(AIM)tosimulate that dense trajectories significantly outperform KLT track- the spectral analysis, neural encoding, and temporal inte- ing of sparse key-points on several human action recogni- gration performed by the human auditory system. In other tion benchmarks. In addition, a trajectory descriptor called words,AIMisatime-domainmodelofauditoryprocessing motion boundary histogram (MBH) was also introduced in intendedtosimulatetheauditoryimageshumanshearwhen [150],whichisbasedonthederivativesofopticalflow.The presentedwithcomplexsoundslikemusic,speech,etc.There derivatives are able to suppress constant motion, making arethreemainstagesinvolvedintheconstructionofanaudi- MBH robust to camera movement. It has been shown to toryimage.First,anauditoryfilterbankisusedtosimulate be very effective for action recognition in realistic videos thebasilarmembranemotion(BMM)producedbyasoundin [150]. Mostly recently, the work of [54] proposed to use thecochlea(auditoryportionoftheinnerear).Next,abank local and global reference points to model the motion of ofhaircellsimulatorsconvertstheBMMintoasimulationof densetrajectories,leadingtoacomprehensiverepresentation theneuralactivitypattern(NAP)producedatthelevelofthe that integrates trajectory appearance, location, and motion. auditorynerve.Finally,aformofstrobedtemporalintegra- Theresultedrepresentationisexpectedtoberobusttocam- tion(STI)isappliedtoeachchanneloftheNAPtostabilize era motion, and also be able to capture the relationships of anyrepeatingpatternandconvertitintoasimulationofour moving objects (or object-background relationships). Very auditory image of the sound. Thus, sequences of auditory competitive results were observed on several human action imagescanbeusedtoillustratethedynamicresponseofthe recognitionbenchmarks. auditory image to everyday sounds. A recent work in this SummarySpatio-temporalvisualfeaturescapturemean- directionshowsthatfeaturescomputed onauditoryimages ingfulstatisticsfromvideos,especiallythoserelatedtolocal perform better than the more conventional MFCC features changesorsaliencyinboththespatialandtemporaldimen- foraudioanalysis[80]. sions. Most of the motion-based features are restricted to Speech is another acoustic clue that can be extracted eitheropticalflowortheirderivatives.Theroleofsemanti- from video soundtracks. An early work by Chang et al. callymoremeaningfulmotionfeatures(e.g.,kinematicfea- [22]reportedthatspeechunderstandingisevenmoreuseful tures[2,4])isyettobetestedinthecontextofthisproblem. thanimageanalysisforsportsvideoeventrecognition.They Furthermore,mostofthesefeaturedescriptorscapturestatis- used filter banks as features and simple template matching ticsbasedoneithermotionalone,ormotionandappearance todetectafewpre-definedkeywordssuchas“touchdown”. independently. Treating the motion and appearance modal- Minami et al. [91] utilized music and speech detection to ityjointlycanfurtherrevealimportantinformationwhichis assistvideoanalysis,wherethedetectionisbasedonsound lostintheprocess.Trajectoriescomputedfromlocalfeatures spectrograms.Automaticspeechrecognition(ASR)hasalso 123 80 IntJMultimedInfoRetr(2013)2:73–101 beenusedforyearsintheannualTRECVIDvideoretrieval cutonabipartitegraphofvisualandaudiowords,whichcap- evaluations[128].AgeneralconclusionisthatASRisuse- ture the co-occurrence relations between audio and visual ful for text-based video search over the speech transcripts wordswithinthesametimewindow.Eachbi-modalwordis but not for semantic visual concept classification. In [96], agroupofvisualand/oraudiowordsthatfrequentlyco-occur ASRwasfoundtobehelpfulforafewevents(e.g.,narrative together.Promisingperformancewasreportedinhigh-level explanationsinproceduralvideos),howevernotforgeneral eventrecognitiontasks. videos. SummaryAudioandvisualfeaturesextractedusingthe Summary Acoustic features have been found useful in methods discussed provide a promising representation to high-levelvideoeventrecognition.Althoughmanynewfea- capturethemultimodalcharacteristicsofthevideocontent. tureshavebeenproposedintheliterature,currentlythemost However,thesefeaturesarestillquitelimited.Forexample, popularlyusedoneisstilltheMFCC.Developingnewaudio MFCCandregion-levelvisualfeaturesmaynotbetheright representationsthataremoresuitableforvideoeventrecog- representation for discovering cross-modal correlations. In nitionisaninterestingdirection. addition,thequalityofthefeaturesmaynotbeadequatedue to noise, clutter, and motion. For example, camera motion 2.4 Audio-visualjointrepresentations canbeausefulcuetodiscriminatebetweenlifeevents(usu- allydepictingrandomcamerajitter,zoom,pan,andtilt)and Audioandvisualfeaturesaremostlytreatedindependently proceduralevents(usuallystaticcamerawithoccasionalpan for multimedia analysis. However, in practice, they are not and/or tilt). None of the current feature extraction meth- independent except in some special cases where a video’s ods addresses this issue as the spatio-temporal visual fea- audiochannelisdubbedbyanentirelydifferentaudiocon- tureextractionalgorithmsarenotcapableofdistinguishing tent, e.g., a motorbike stunt video dubbed with a music between the movement of the objects in the scene and the track. In the usual cases, statistical information such as camera motion. Another disadvantage with feature-based co-occurrence,correlation,andcovariancecausalitycanbe techniques is that they are often ungainly in terms of both exploitedacrossbothaudioandvisualchannelstoperform dimensionalityandcardinality,whichleadstostorageissues efficientmultimodalanalysis. asthenumberofvideosisphenomenal.Itisthereforedesired In Beal et al. [13] proposed to use graphical models to toseekanadditionalintermediaterepresentationforfurther combineaudioandvisualvariablesforobjecttracking.This analysis. method was designed for videos captured in a controlled environmentandthereforemaynotbeapplicabletouncon- 2.5 Bagoffeatures strained videos. More recently, Jiang et al. [51] proposed a joint audio-visual feature, called audio-visual atom (AVA). Thelocalfeatures(e.g.,SIFT[77]andSTIP[65])discussed AnAVAisanimageregiontrajectoryassociatedwithboth abovevaryinsetsize,i.e.,thenumberoffeaturesextracted regional visual features and audio features. The audio fea- differsacrossvideos(dependingoncomplexityofcontents, ture(MFCCofaudioframes)andvisualfeature(colorand video duration, etc.). This poses difficulties for measuring textureofshorttermregiontracks)arefirstquantizedtodis- video/frame similarities since most measurements require crete codewords separately. Jointly occurring audio-visual fixed-dimensional inputs. One solution is to directly match codewordpairsarethendiscoveredusingamultipleinstance localfeaturesbetweentwovideosanddeterminevideosim- learningframework.Comparedtosimplelatefusionofclas- ilaritybasedonthesimilaritiesofthematchedfeaturepairs. sifiersusingseparatemodalities,betterresultswereobserved Thepairwisematchingprocessisneverthelesscomputation- usingabagofAVArepresentationonanunconstrainedvideo ally expensive, even with the help of indexing structures. dataset [76]. This approach was further extended in [52], This issue can be addressed using a framework called bag- where a representation called audio-visual grouplet (AVG) of-featuresorbag-of-words(BoW)[127].Motivatedbythe wasproposed.AVGsaresetsofaudioandvisualcodewords. well-known bag-of-words representation of textual docu- The codewords are grouped together as an AVG if strong ments,BoWtreatsimagesorvideoframesas“documents” temporalcorrelationsexistamongthem.Thetemporalcorre- and uses a similar word occurrence histogram to represent lationsweredeterminedusingGranger’stemporalcausality them, where the “visual vocabulary” is generated by clus- [43].AVGswereshowntobebetterthansimplelatefusion tering a large set of local features and treating each cluster ofaudio-visualfeatures. centerasa“visualword”. The methods introduced in [51,52] require either frame BoW has been popular in image/video classification for segmentationorforeground/backgroundseparation,whichis years.TheperformanceofBoWissensitivetomanyimple- computationally expensive. Yeetal.[168]proposedasim- mentation choices, which have been extensively studied in pleandefficientmethodcalledbi-modalaudio-visualcode- several works, mostly in the context of image classifica- words.Thebi-modalwordsweregeneratedusingnormalized tion with the frame-based local features like SIFT. Zhang 123 IntJMultimedInfoRetr(2013)2:73–101 81 et al. [174] evaluated various local features and reported audiocommunity),wheretheacousticfeaturesarecomputed competitive object recognition performance by combining locally from short-term auditory frames (a time window of multiple local patch detectors and descriptors. Jiang et al. tensofmilliseconds),resultinginasetofauditorydescrip- [55] conducted a series of analysis on several choices of tors from a sound. This representation has been studied in BoW in video concept detection, including term weight- severalworksforaudioclassification,wheretheimplementa- ing schemes (e.g., term frequency and inverse document tionchoicesslightlydifferfromthatofthevisualfeaturerep- frequency) and vocabulary size (i.e., the number of clus- resentations.Mandeletal.[82]usedGaussianmixturemod- ters). An important finding is that term weighting is very els(GMM)todescribeasongasabagofframesforclassifi- important, and a soft-weighting scheme was proposed to cation.Aucouturieretal.[5]andLeeetal.[70]alsoadopteda alleviate the effect of quantization error by softly assign- similarrepresentation.Aucouturieretal.conductedaninter- ingadescriptortomultiplevisualwords.Theusefulnessof estingsetofexperimentsandfoundthatbag-of-framesper- such a soft weighting scheme is also confirmed in detect- forms well for urban soundscapes but not for polyphonic ing a large number of concepts in TRECVID [21]. Simi- music. In place of GMM, Lu et al. [78] adopted spectral lar idea of soft assignment was also presented by Philbin clustering to generate auditory keywords. Promising audio et al. [109]. van Gemert et al. [41] proposed an interest- retrievalperformancewasattainedusingtheirproposedrep- ing approach called kernel codebooks to tackle the same resentationonsports,comedy,awardceremony,andmovie issue of quantization loss. For vocabulary size, a gen- videos.Cottonetal.[26]proposedtoextractsparsetransient eral observation is that a few hundreds to several thou- features corresponding to soundtrack events, instead of the sands of visual words might be sufficient for most visual uniformlyanddenselysampledaudioframes.Theyreported classification tasks. In addition, Liu and Shah proposed that,withfewerdescriptors,transientfeaturesproducecom- to apply maximization of mutual information (MMI) for parable performance to the dense MFCCs for audio-based visual word generation [75]. Compared to typical meth- video event recognition, and the fusion of both can lead to ods like k-means, MMI is able to produce a higher level further improvements. The bag-of-audio-words representa- of word clusters, which are semantically more meaning- tionhasalsobeenadoptedinseveralvideoeventrecognition ful and also more discriminative for visual recognition. systems with promising performance (e.g., [10,58], among A feature selection method based on the page-rank idea others). was proposed in [74] to remove local patches that may Figure5showsageneralframeworkofBoWrepresenta- hurttheperformanceofactionrecognitioninunconstrained tion,usingdifferentaudio-visualfeatures.Aseparatevocab- videos. Feature selection techniques were also adopted ularyisconstructedforeachfeaturetype,byclusteringthe to choose discriminative visual words for video concept correspondingfeaturedescriptors.Finally,aBoWhistogram detection[56]. is generated for each feature type. Histograms can then be Spatial locations of the patches are ignored in standard normalized to create multiple representations of the input BoWrepresentation,whichisnotideal,sincethepatchloca- video. In the simplest case, all the histogram representa- tionsconveyusefulinformation.Lazebniketal.[68]adopted tions can be concatenated to create a final representation asimilarschemelikesomeoftheglobalrepresentations,by beforeclassification.Thisapproachisusuallytermedasearly partitioning a frame into rectangular grids at various lev- fusion.Analternativeapproachisthelatefusionwherethe els,andcomputingaBoWhistogramforeachgrid.Thehis- histogramrepresentationsareindependentlyfedintoclassi- togramsfromgridsateachlevelarethenconcatenatedasa fiersanddecisionsfromtheclassifiersarecombined.These featurevectorandpyramidmatchkernelisappliedtomea- willbediscussedindetaillaterinSect.3.4. surethesimilaritybetweenframes,eachwithmultiplefea- SummaryAsitisevidentthatthereisnosinglefeature tures from different spatial partitioning levels. This simple that is sufficient for high-level event recognition, current methodhasbeenprovedeffectiveinmanyapplicationsand research strongly suggests the joint use of multiple fea- is now widely adopted. Researchers also found that direct tures, such as static frame-based features, spatio-temporal concatenation of BoW histograms from grids of all levels features,andacousticfeatures.However,whetherBoWisthe plussupportvectormachines(SVMs)learningwithstandard bestmodel toobtain meaningful representations of avideo kernelsofferssimilarperformancetothepyramidmatchker- remains an important open issue. Although this technique nel.However,itisworthnotingthat,differentfromBoWof performs surprisingly well [24,58,96,98], the major draw- the 2D frames, the spatial pyramid architecture has rarely back of the systems conforming to this paradigm is their beenusedinspatio-temporalfeature-basedBoWrepresenta- incapability to obtain deep semantic understanding of the tions,againindicatingthedifficultyinhandlingthetemporal videos,whichisaprevalentissueinhigh-leveleventanaly- dimensionofvideos. sis.Thisisbecause,theyprovideacompactrepresentationof TheBoWframeworkcanalsobeextendedtorepresenta acomplexeventdepictedinavideobasedontheunderlying sound as a bag of audio words (a.k.a. bag-of-frames in the featureswithouthavinganyunderstandingofthehierarchical 123

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.