First Person Action Recognition Thesissubmittedinpartialfulfillment oftherequirementsforthedegreeof MSbyResearch in ComputerScienceandEngineering by SuriyaSingh 201307505 [email protected] InternationalInstituteofInformationTechnology Hyderabad-500032,INDIA November2016 Copyright(cid:13)c SuriyaSingh,2016 AllRightsReserved To FamilyandFriends Acknowledgements I would like to thank my advisers Dr. Chetan Arora and Prof. C.V. Jawahar for all the guidance andsupport. Theyhavebeenaconstantsourceofinspirationandtheirguidancehasbeencriticaltomy developmentasabuddingresearcherincomputervisionandmachinelearning. Thankyouforpushing me beyond what I thought were my limits. I am able to graduate with a framework that will carry me throughoutmycareer, whichissomethingthatIameternallygratefulfor. Itwasbothanhonouranda greatprivilegetoworkwiththem. WorkingatCVITwasgreatfun. Iwasfortunatetohavemetbrilliantstudentsandwonderfulfriends —Aniket,Vijay,Koustav,Mallikarjun,Minesh,Priyam,Shushman,Jay,Vidyadhar,Yashashwi,Anand Sir,Jobin,Mohak,Ajeet,Praveen,Pritish,DevenderandViresh. ManythankstoProf. P.J.Narayan,Prof. JayanthiandDr. AnoopNamboodirifortheirencouraging presence andfor providingan environmentconducive toresearch of thefinest quality. I am gratefulto Mr. R.S.Satyanarayana,Varun,RajanandNandiniforallthesupport. Thisjourneywouldnothavebeenpossiblewithoutmyfriends—Naveen,VijayKant,Gaurav,Ankit, Nivedita,AbhinavandShrey. Finally,Iwouldliketothankmyfamilyforsupportingmealways. v Abstract Egocentric cameras are wearable cameras mounted on a person’s head or shoulder. With their abil- ity to capture what the wearer is seeing, such cameras are spawning new set of exciting applications in computer vision. Recognising activity of the wearer from an egocentric video is an important but challenging problem. This problem is more challenging than third person activity recognition due to unavailability of wearer’s pose. Unstructured movement of the camera due to natural head motion of thewearercausessharpchangesinthevisualfieldoftheegocentriccameramakingproblemevenmore challenging. This causes many standard third person action recognition techniques to perform poorly on such videos. On the other hand, objects present in the scene and hand gestures of the wearer are themostimportantcuesforfirstpersonactionrecognition. However,suchcuesaredifficulttosegment andrecognizeinanegocentricvideo. Carefullycraftedfeaturesbasedonhandsandobjectscuesforthe problemhavebeenshowntobesuccessfulforlimitedtargeteddatasets. In the first part of our work, we propose a novel representation of the first person actions derived from feature trajectories. The features are simple to compute using standard feature tracking and does not assume segmentation of hand/objects or recognizing object or hand pose unlike in many previous approaches. We train a bag of words classifier with the proposed features and report a significant per- formanceimprovementonpubliclyavailabledatasets. In the second part of the thesis, we propose convolutional neural networks (CNNs) for end to end learning and classification of wearers actions. The proposed network makes use of egocentric cues by capturing hand pose, head motion and saliency map. The proposed network model is compact and therefore can be trained from relatively small number of labeled egocentric videos that are available in egocentric settings. We show that the proposed network can generalize and give state of the art performance on various egocentric action datasets widely different from each other visually as well as dynamically. vi Contents Chapter Page 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 FirstPersonActions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Scopeofthisthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.1 ProblemstatementandChallenges . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 RelatedWorks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.4 DatasetsandAnnotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.5 EvaluationProtocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.6 ThesisOutline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2 FirstPersonActionRecognitionUsingTrajectoryAlignedFeatures . . . . . . . . . . . . . . 10 2.1 Baseline: DenseTrajectories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2 MotionCues: MotionBoundaryHistogram . . . . . . . . . . . . . . . . . . . . . . . 12 2.3 ActioninReverse: Bi-directionalTrajectories . . . . . . . . . . . . . . . . . . . . . . 13 2.4 HandlingWildMotion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.5 FastandSlowActions: TemporalPyramids . . . . . . . . . . . . . . . . . . . . . . . 14 2.6 KinematicandStatisticalfeatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.7 EgocentricCues: CameraActivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.8 ExperimentsandResults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.8.1 ResultsonDifferentDatasets . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.8.2 FailureAnalysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.8.3 ImplementationDetailsandRuntimeAnalysis . . . . . . . . . . . . . . . . . 20 2.9 SemanticallyMeaningfulTemporalSegmentationusingProposedFeatures . . . . . . 21 2.10 GenericActionRecognitionfromEgocentricVideos . . . . . . . . . . . . . . . . . . 22 2.10.1 DominantMotionFeature . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.10.2 LongtermvsShorttermactionclassification . . . . . . . . . . . . . . . . . . 24 3 FirstPersonActionRecognitionUsingDeepLearnedDescriptors . . . . . . . . . . . . . . . 28 3.1 EgoConvNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.1.1 NetworkInput . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.1.2 EgoConvNetArchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.1.3 TrainingEgoConvNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.2 Three-StreamArchitectureforFirstPersonActionRecognition . . . . . . . . . . . . . 34 3.3 ExperimentsandResults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.4 QualitativeResultsandErrorVisualisation . . . . . . . . . . . . . . . . . . . . . . . . 37 vii viii CONTENTS 3.4.1 FailureAnalysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4 ConclusionsandFutureDirections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 AppendixA:Exampleimagesfromthedatasets . . . . . . . . . . . . . . . . . . . . . . . . 43 List of Figures Figure Page 1.1 EgocentriccamerassuchasGoProandGoogleGlassaretypicallywornontheheador alongwiththeeyeglasses. Thesecamerascapturethewearer’sactionsfromthefirstper- sonpointofviewandarewidelyusedtocaptureextremesportsandlifeloggingvideos. Largecamerashakesduetonaturalheadmovementandunavailabilityofwearer’spose arethekeychallengeswhiledealingwithsuchvideos. . . . . . . . . . . . . . . . . . 2 1.2 Examplesofwearer’sactioncategoriesfromdifferentdatasetsweproposetorecognize inthisthesis: GTEA[17](toprow),Kitchen[52](middlerow)andADL[42](bottomrow). The columns represent the actions ‘pour’, ‘take’, ‘put’, ‘stir’ and ‘open’. The actions vary widely across datasets in terms of appearance and speed of action. Features and techniquewesuggestinthisthesisisabletosuccessfullyrecognizethewearer’sactions acrossdifferentpresentedscenarios,showingtherobustnessofourmethod. . . . . . . 2 1.3 Sample frames from the ‘Extreme Sports’ dataset introduced by us. The figure shows examples for ‘jump’ action in different sports categories: ski, jetski, mountain biking andparkour. Notethevariationsamongthesampleswhichmakesthedatasetextremely challengingforactionrecognitiontask.. . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1 We propose to use the motion cues as well as the visual cues, from the trajectories of object,forfirstpersonactionrecognition. Firstandsecondcolumnsshowtheobjectand cameratrajectoriesfor‘pour’and‘stir’actions. Thereisenoughinformationinthecues to classify first person actions. Similar works in egocentric vision use complex image segmentationalgorithmstoarriveatthelabelingofhandsandhandledobjects.. . . . . 11 2.2 Motionoftheegocentriccameraisdueto3Drotationofwearer’sheadandcanbeeasily compensated by a 2D homography transformation of the image. Left: Optical flow overlayed on the frame. Right: Compensated optical flow followed by thresholding. Almostallcameramotionhasbeencompensatedbythissimpletechnique. . . . . . . . 13 2.3 3leveltemporalpyramidforacoarse-to-fineBOWrepresentation. EachblockisaBOW vectoroflength2k. Weusetemporalpyramidfor HOG and HOF featuresinourexperi- ments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 ix x LISTOFFIGURES 2.4 Somefailurecasesofthisapproach. (a)‘shake’classifiedas‘stir’duetohighvisualand motionsimilarity. Ontheright,asimilarframewith‘stir’actionclassifiedcorrectly. (b) ‘pour’ classified as ‘spread’ due to hand movement, notice the high similarity between ‘pouring’ mayonnaise and ‘spreading’ jam. On the right, a frame classified correctly asspread. Alargeportionofobservederrorsoccurontheactionboundarieswherethe features from two actions merge. (c) shows two frames which are at action boundary ‘open’ (left, predicted correctly) and ‘BG’ (right, predicted as ‘open’), and (d) On the left,‘fold’classifiedas‘pour’duetoverylesssamplefor‘fold’availableinthedataset. ‘fold’actionaccountsforlessthan0.5%ofalltheactionsinthedatasetandhasonly82 frames for training and 54 frames for testing. On the right, a frame classified correctly aspour. Sameobjectspresentinleftandrightimagesmighthaveleadtotheconfusion. We believe, our method requires more examples of such scarce actions to distinguish betweenthesecases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.5 Confusionmatrixforourmethodon GTEA dataset. Weobservethatmanyerrorsoccur because action boundary is not clearly defined. ‘close’ is commonly confused with ‘open’ due to similarity in the nature of the action. Also, most action occurs before or after ‘background’, hence the common confusion. High percentage error for some classes (e.g., fold, shake, put etc.) is because very less sample of those actions are availableinthedatasetfortrainingaswellastesting. . . . . . . . . . . . . . . . . . . 19 2.6 Semantically Meaningful Temporal Segmentation using Proposed Features: Error vi- sualization on all test frames (7 videos) of GTEA dataset. Each action label has been colorcoded. Weuse MRF basedmethodforrefiningpredictedlabel. Weassignpenalty according to difference in global HOF histogram of a frame when compared with that ofitsneighbors. Predictedactionlabelsusingclassifierscoreareshowninthetoprow, actionlabelsafterMRFbasedtemporalsegmentationinthemiddlerowandgroundtruth actionlabelsinthebottomrow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.7 Firstpersonactionscanbebroadlydividedintotwocategoriesshorttermandlongterm actions. Toprowshowsshorttermactions(left: ‘scoop’andright: ‘stir’)fromGTEA[17] datasetandtrajectoryalignedfeaturesusedby[59]foractionrecognition. Bottomrow shows long term actions (left: ‘riding’ and right: ‘driving’) from Egoseg [44] dataset and motion feature used by [44] for action recognition. In recent works, method and featuresarespecifictoonekindofactionanddoesnotworkwellfortheotherkind. The focusofthissectionisonrecognizingkindofactioninanegocentricvideoandidentify appropriatefeaturesaswellasmethodforfurtherprocessing. . . . . . . . . . . . . . . 23 2.8 Motionoftheegocentriccameraisdueto3Drotationofwearer’sheadandcanbeeasily compensated by a 2D homography transformation of the image. Left: Optical flow overlayed on the frame. Right: Compensated optical flow followed by thresholding. Almost all camera motion has been compensated by this simple technique. It is the compensatedflowthatprovetobemoreusefulforidentifyingtypeofactionpresentin thevideo. Toprowshowsshorttermaction‘take’, bottomrowshowslongtermaction ‘walking’. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.9 Exampleframesfromvariousdatasetsusedfortrainingourclassifierusing DM feature. Toprow: shorttermactions. Bottomrow: longtermactions. . . . . . . . . . . . . . . 26
Description: