Interestingly, the computer vision commu- sualattentioninnaturalisticenvironments.Mostrecently, 2 nityhasonlyrecentlystartedtoexplorethisnewdomain emergingproductslikeGoogleGlassstartedmakingfirst 0 ofegocentricvision,whereresearchcanroughlybecate- attemptstobringtheideaofwearable,egocentriccameras . 1 gorizedintothreeareas: Objectrecognition,activityde- intothemainstream. 0 tection/recognition, video summarization. In this paper, Fromacomputervisionstandpoint,videosfromthese 5 wetrytogiveabroadoverviewaboutthedifferentprob- first-persondevicesposealotofchallenges. Becausethe 1 lems that have been addressed and collect and compare camera is constantly moving, the motion is highly non- : v evaluationresults. Moreover,alongwiththeemergenceof linearandunpredictable.Asaresult,objectsmayrapidly i X thisnewdomaincame theintroductionofnumerousnew disappear and reappear in the field of view. In extreme and versatile benchmark datasets, which we summarize cases(suchassportvideos),onemustalsoexpectthings r a andcompareaswell. like motion blur, splashing water or glare. On the other hand, some qualitiesof egocentricvideo may be helpful forspecificapplications.Forexample,objectsthattheob- 1 Introduction server manipulatesor people and faces that the observer interacts with, tend to naturally be centered in the view Most of the classic work in computer vision has been and are less likely to be occluded then they might be if devoted to studying either static images or video from capturedfromastatic,thirdpersoncamera. stationary cameras (such as tracking objects in surveil- In the next section, we will introduce the most recent lance applications). Recently, technological advances work from the computer vision community in the do- have made lightweight, wearable, egocentric cameras main of egocentric video. We further try to point out both practical and popular in variousfields. The GoPro egocentric-specific challenges that occurred within the cameraforinstancecanbemountedtohelmetsandispop- givenproblems,butalsomentionsituationsweretheego- 1 centricparadigmwasactuallyuseful. Weemphasizethat objects(milkcarton,wateringcan, etc.), whereeachob- egocentricvideoisanemergingfieldandalotofthework jectwasbeingmanipulatedbyhandsinanobject-specfic that we will reference can be considered as pioneering way.Toobtainsomebaselineresultsfortheirdataset,they work. As a result of that, not many things are built on annotatedasmallsubsetofframeswithground-truthob- topofeachotheranddirectquantitativecomparisonsbe- jectversusbackgroundsegmentations. Theyusedastan- tweendifferentworksareoftendifficult. dardSIFTbasedrecognitionsystemdescribedin[2]and Anothereffectofthenovelnatureoftheegocentricdo- trained a multi-class SVM. They achieveda 12% recog- mainistheemergenceofnumerousnewandveryversa- nitionrate comparedto a randomchanceof2.4%. They tiledatasets. Whilebrieflyexplainingtheindividualdata went on to quantify the influence of various egocentric- sets alongwiththe workin section 2, we givea detailed specific challenges, such as limited texture of objects, overviewaboutpubliclyavailabledatasetsinsection3. backgroundclutterandhandocclusion. Togainanupper In section 4, we summarize and compareresults from bound for recognition performance, they used the SIFT the previoussections and finally section 5 concludesthe recognitionsystemoncleanexemplarimagesoftheirob- paper. jects,obtaininganaverageaccuracyof63.7%.Simulating occlusion on the clean exemplarshad the accuracy drop down to 57.0% while simulating background clutter re- 2 Recent Work sultedina20%dropinaccuracyandcombiningbothhad the accuracydropdown to 30.3%. They suggestmotion In this section, we introduce recent work in the field of andlocationpriorsaswellashanddetectionasfuturere- egocentric video. We group this work into three cate- searchdirections. gories. The first category deals with object recognition Follow-upworkhasbeendonebyRenandGu[3]who with respect to objects that are being manipulated (by developedamotion-basedapproachtosegmentoutfore- hand) by the first-person observer. The second category ground objects in egocentric video in order to improve deals with the detection and recognition of first-person objectrecognitionaccuracy. Theideaisbasedontheob- actionsandactivities. Wewillseethatthiscategorynatu- servation that there are some regularitieswith respect to rallyemergesfromthefirstone,asmostoftheconsidered motioninegocentricvideothatareusefultowardsmotion activitiesarecharacterizedbytheobjectsbeingused.The segmentation:Duringobjectmanipulation,handsandob- third category deals with so called “life logging” video jects have the tendency to appear near the center of the data. This data is mainly characterized by the fact that viewandbody(i.e. camera)motionsarerathersmalland itinvolveshourlong,continuousvideodatadepictingthe horizontal. Their model explicitly addresses this with a “life”ofthefirst-personobserver. Workinthisareausu- motionpriorandalocationpriorforeachpixel. Thedis- ally deals with data summarization, i.e. the extraction tributionforthelocationpriorisbuiltbyaveragingground ofrelevantorrepresentativeframesoractions. However, truthsegmentationmasksandthemotionpriorisbasedon there is also work in more specific tasks such as the de- optical-flow results obtained from video parts that only tection of social interactions based on egocentric video containbackground(nohandsorobjects),thusgivingan recordedbyagroupofpeopleinathemepark. average flow estimation for each backgroundpixel. Ad- ditionally,theyusedtemporalcuesthattakesegmentation masks from previous frames into account. Finally, they 2.1 Object Recognition usedthecoarse-to-finevariationalopticalflowalgorithm One of the first analyses of object recognition in ego- of[4]tocreatedenseopticalflowacrosstwoframesand centric video was done by Ren and Philipose [1]. Mo- then used RANSAC to fit the motion vectors into affine tivated by the idea that recognizing handled objects can layers. Equipped with these motion features and priors, provide essential information about a person’s activity, theytrainedamax-marginclassifierforpixelwisefigure- theywantedtoexplorethechallengesandcharacteristics groundclassificationandcleaneduptheresultsusingthe of object recognition in the context of egocentric video. standardGraphCutalgorithm. Fortesting,theyusedthe Theycollectedavideodatasetconsistingof42everyday same 42objectdatasetas[1]andimprovedtheaccuracy 2 oftheSIFTbasedrecognitionsystemfrom12%to20%. knife”or“openthefridge”,whileanactivitydescribesa They also tested a latent HOG based recognitionsystem morecomplexaggregationofactionssuchasmakingcof- [5] and found that the accuracy improved from 38% to fee. 46%. Fathi et al. [6] took advantage of the egocentric 2.2.1 EarlyWorkUsingGist paradigm (objects of interest tend to be centered and at a large scale) to learn object classification and segmen- Earlyworkinthedomainofbothunsupervisedactionseg- tation with very weak supervision. The motivating idea mentation and supervised action classification was done to use object recognitionas a way to make inference on bySpriggsetal.[8].Theyintroducedthe“CMUkitchen” possible activities is similar to that of [1], but is taken datasetthatcontainsmultimodalmeasures,includingego- a step further in the sense that they explored egocentric centricvideo,ofpeoplecookingdifferentrecipes(brown- activities involving multiple objects (such as making a ies, pizza, etc.) in a kitchen environment. Each frame peanutbutterandjellysandwich).Theyhypothesizedthat waslabeledwithanactionclass(suchas“stirring”). For the co-occurence of different objects within those activ- action segmentation, rather than trying to recognize ob- ities can be exploited for object detection and localiza- jectslikemostofthefollow-upwork,theycomputedthe tion.Theyperformedfigure-groundsegmentationaswell, gist [9] of each frame. The assumption is that, under buttheirapproachdifferedfrom[3]asitallowedobjects the egocentric paradigm, specific actions are performed to become part of the background after being manipu- in front of a somewhat constant background, making a lated. This is accomplished by splitting the video into gist feature vector a reasonable approach to model each shortintervalsandcreatingalocalbackgroundmodelfor frame. TheyperformedPCAtoreducethevectordimen- each. Fortheweaklysupervisedlearning,theycollected sionality and estimated differentGaussian mixturemod- a dataset of 7 daily activities involving multiple objects elstoinvestigatewhetherthesefeaturesclusterintosim- (makingcoffee/tee/sandwiches). Eachvideowasonlyla- ilar scenes. For some activities, such as “stirring”, they beled with the list of objects it contained. To learn an saw promising results (70% of frames labeled with this appearancemodelforeachobjecttype,theyusedthedi- action were assigned to the right cluster) but noted that versedensitybasedmultipleinstancelearningframework results do not generalize well as model parametersneed of[7].Theyfurtherusedequalityconstraintstoassignthe tobevariedtocapturedistinctsetsofactions. Theyalso same label to regions with significant temporal connec- explored supervised action classification by training an tions. Theobjectrecognitionaccuracyrangedfromabout HMMwithamixtureofGaussiansoutputonthegistfea- 10% (sugar) to about 95% (coffee). Additionally, their tures and obtained an average classification accuracy of figure-background segmentation approach outperformed 9.38% (chance being 3%). Lastly, they applied a sim- [3] on the 42 objectdataset, havinga 48%segmentation pleKNNmodel,whereeachtestframefromonesubject errorrateasopposedto67%. is given the label of the frame with the smallest Euclid- ian distance from the set of framesof all other subjects, reachingaclassificationaccuracyof48.64%. 2.2 Activity andActionDetection Many authors recognized that a lot of activities that are 2.2.2 Object-basedActivityDetection interesting from an egocentric perspective are character- izedbytheobservermanipulatingobjectsinfrontofhim. FurtherresearchonactivitydetectionwasdonebyPirisi- Thisisverydifferentfromthirdpersonvideoswhereob- avash and Ramanan [10], whose work stands out due to jects might be hard to see and thus, people focussed on theirlarge,versatileandfullylabeleddataset. Theycap- activities that can be modeled by different body move- tured 18 daily indoor activities such as brushing teeth, ments(e.g. dancing). Inthissection,wewillusetheter- washing dishes, or watching television, each performed minologythathasbeenestablishedinrecentworkonego- by20differentsubjectsintheirrespectiveapartments.42 centricactivityandactiondetection,whichisthatactions different object classes involved in these activities were describesimple, straightforwardthingssuch as“takethe annotated with bounding boxes. Each object also had a 3 labeldepictingwhetheritiscurrentlyactive(inhands)or whichdemonstratethatduringobjectmanipulationtasks not. Also driven by the idea that activities are all about a substantial percentageof gaze fixationsfall upontask- the objects being involved, they used their data to build relevantobjects.Theyusedagenerativemodeltodescribe an activity model that explicitly models object use over the relationship between egocentric action and gaze lo- time. For every frame of a given activity, they used the cation. This means they learned the probability of tran- part-based object model by [11] to record a score based sitioning to a gaze location gt, given gt−1 and the cur- on the most likely position and scale for each of their rent action a, as well as the likelihood of an image fea- 42 object classes. Averaging this score over all activity ture xt, given the currentaction a and the gaze position frames yielded a histogram of object scores for a spe- gt. Theimagefeatureswerebasedonobjectfeaturesde- cificactivity. Theywentontotemporallysplitthevideo scribed in their earlier work [12], as well as appearance into halves in a pyramid fashion, each time calculating features and future manipulation features. The appear- the object score histogram, and thus ending up with an ancefeatureswereusedtodescribethefixatedpartofan activity modelthatdescribesobjectuse overtime. They objectandwerebasedoncolorandtexturehistogramsin learned a linear SVM on these models. Trained with all a circularareaaroundthegazelocation. Futuremanipu- objects, they achieveda 32.6%activity classification ac- lation features were aimed to take advantage of the fact curacy(chancebeing5.6%)andtrainedwith onlyactive thatgazeisusuallyasplitsecondaheadofthehands,so objectstheyachieved40.6%accuracy. knowingthehandlocationafewframesaheadprovidesa An alternative, unsupervised activity model was pro- cue ofthegazelocationin thecurrentframe. Theyused posed by Fathi et al. [12]. Continuing their own work a new dataset involvingdifferentkinds of meal prepara- on object recognition in egocentric video [6], they pro- tions similar to their previous work [6] but extended by posed a graph based model that takes advantage of the the gaze data. Theyfoundthatincorporatinggazeinfor- semantic relationshipbetween activities, actionsand ob- mationimprovedtheactionrecognitionaccuracyto47% jects. They worked on the same dataset as they did comparedto 27% when using the methodof [12]. They in [6], which contains activities such as making various also found promising results when predicting gaze loca- kindsof sandwiches. Based on detected objects, object- tions given the action. However, when inferencing both handinteractionsanda set ofaction labels(“spreadbut- actionandgazelocationactionrecognitionaccuracyonly ter on bread”, etc.) they used an approach similar to improvesmarginally(29%). Expectation-Conditional Maximization [13] to learn ac- tionsandthenlearnactivitiesfromactions. Then,thein- 2.2.3 State-basedActivityDetection ferredactivitylabelwasfixedandusedtoenhanceaction recognitionresults,astheactivitycanlimitthesetofpos- Very recently, Fathi et al. proposed a new approach sible actions as well as enforce a certain order. Finally, to model actions in egocentric videos [16], exploit- theyenhancedtheirinitialobjectrecognitionbylearning ing the fact that goal-oriented actions (“open coffee aprobabilisticobjectmodelthatincorporatestheinferred jar”) within object-manipulation activities (making cof- action priors. They recognized 6 out of 7 activities cor- fee/sandwiches) can be detected by state changes of the rectlyandtheiractionrecognitionaccuracywasat32.4% objectsbeinginvolved. Thus,fortrainingpurposes,they (chancebeing1.6%). Theyalso showedthatthis frame- annotatedeachactionwithstartframe,endframe,action workindeedimprovedtheirinitialobjectrecognitionper- labelaswellasasetofnounsdescribingtheobjectsbeing formance, achieving better results for almost all object involved. Focussingonlyonforegroundobjects[6],they classes. discovered regionsthat changedbefore and after the ac- Fathietal. extendedtheirworkin[14]byadditionally tionandclusteredthemintoregionsthatconstantlyappear consideringeyegaze,usingcalibrated,head-mountedeye duringtheactiontopruneoutirrelevantregions(suchas trackers in combination with egocentric cameras. They hands).Theythendescribedthoseregionswithcolor,tex- raised the question whether knowing the fixation loca- tureandshapefeaturesandtrainedalinearSVMtolearn tions helps to better recognize actions and vice versa. astate-specificregiondetector.Theactionitselfwasthen Thisapproachismotivatedbypsychologicalstudies[15] describedasa quantizedresponseofstartandendframe 4 to each region detector. With those responses, a second performed a DFT on the optical flow amplitudes to ob- linearSVMwastrainedtobuildanactiondetector. They tainfrequencyhistograms. TheyusedaDirichletmixture validated their model in terms of action recognition and model[20]tofirstinferamotioncodebookandtheninfer activity segmentation, achieving a 39.7% action recog- ego-actioncategories. Theyevaluatedtheir performance nition accuracy (based on 61 action classes) and outper- on both controlled, choreographed videos as well real- formedtheirpreviousworkin[12]. Theyachieveda33% world sport videos obtained from YouTube and reported accuracyforactivitysegmentation,basedonthepercent- an F-measure(consideringbothprecisionand recall) for age of test video frames that had been labeled with the each sport. They achieved an F-measure of 0.93 for the correctaction. choreographedvideosandandaverageF-measureof0.6 for the sport videos. Ego-actions varied between sports and involved labels such as “hop down”, “turn left” or 2.2.4 InteractionandSportActivities “wedgeleft”forskiing. Ryoo and Matthies recently were the first to explore interaction-levelhumanactivitiesfromafirst-personview 2.3 LifeLogging Video [17]. Motivated by surveillance, military or general human-robot interaction scenarios, they constructed a Anotherareathatisparticularlyofinterestintheubiqui- dataset of humans directly interacting with the egocen- touscomputingcommunityandcontainsegocentricvideo tric observer. Interactions varied from friendly (shaking is the idea of “life logging”. Here, a first-personcamera hands or petting the observer) to hostile (punching the continuouslyrecordsawholedayofitswearer’slife. The observer or throwing objects at the observer). Based on overall motivation that is mentioned by a lot of authors theideathatinteractionwiththeobservercausesalotof istoeventuallydevelopsystemsthatcanserveasaretro- ego-motion, they used a combination of global and lo- spective memoryaid for peoplewith memoryloss prob- cal motion descriptors to depict different activities. For lems [21]. Thus, a common goal is to summarize long, globalmotion,theyappliedaconventionalpixel-wiseop- egocentricvideoordetectnovel,anomalousevents. tical flow algorithm and built a histogram based on lo- cation and directionsof the flow. For localmotion, they 2.3.1 VideoSummarization interpreted the video as a 3-D XYT volume by concate- natingframesovertimeandappliedthecuboidfeaturede- Doherty et al. [22] were among the first to investigate tectorby[18]toobtainvideopatchesthatcontainsalient keyframeselectionmethodsin theegocentricdomainby motion.Thesemotiondescriptorswereclusteredusingk- lookingattheMicrosoftSenseCam,acamerawornaround meansto obtaina set of visualwords. They represented theneckthattakesanimageeverycoupleofseconds(an anactivityvideoasahistogramofthesewordsandfinally averageof1,900imagesaday)tocreateapassivelycap- trainedanSVM.Resultswereevaluatedintermsofactiv- tured,visuallifelog.Theypointedoutthatalotofthees- ityclassificationanddetection,receivinga89.6%classi- tablishedmechanismsforkeyframeselectiondonottrans- ficationaccuracy(basedon7differentactivities),aswell late directlyto thedomainof lifeloggingvideo,as they, asanaveragedetectionprecisionof0.709. forinstance,relyonmotionanalysisand,duetothevery Kitanietal. [19]observedtheincreasedusageofego- low frame rate of their camera, motion is virtually non- centriccamerasinsportvideos(biking,skiing,etc.).They existing. Also, passive capture devices may not always developeda fast, unsupervisedapproachto index videos capturehighqualityimagesandhandsorclothingcover- intodifferentego-actionsthatissupposedtohelptheath- ingpartsofthelensarequitecommon. First,theauthors lete to locate and review specific parts without the bur- split the set of images into different events where event den of manual search. Similar to [17], they leveraged boundariesaredeterminedbyhighdissimilaritybetween thefactthatfirst-personsportvideoscontainlotsofego- framesaccordingtoadistancemetricbasedoncolorand motion and used optical flow histograms to describe the edge descriptors. They compared and investigated var- motions of a specific sport video. As a lot of the sport ious approaches to select a keyframe for each of those activities contain periodic movements, they additionally events. Approaches varied from very simple solutions 5 such as taking the middle image of the event, over tak- amongevents),findingthattheirmethodwasfoundbetter ing the image that is closest to the average value of all 68.75%ofthetime. images in the event, to more complex solutions like the imagethatisclosesttotheeventaverage,farthestfromthe averageofothereventsandperformswellonvariousim- Lu and Grauman[26] extendedthis workby develop- agequalitytestsforsharpnessandcontrast. Over13,000 ingastory-driven(ratherthanobject-driven)approachto keyframes were judged by user ratings, where the most summarize egocentric life logging video. The idea is to complexapproachhada8.4%higherscorethanthebase devisean influencemetricthatcaptureseventconnectiv- line(middleframe). Theyfoundthatissuesmainlyoccur ity and accounts for how one event leads to another, in duringeventsthatincludealotofmotion(suchaswalking order to create a summary that provides a better sense home)astheremaybevastdifferencesbetweenimagesof of a story. They also introduced a novel temporal seg- thesameeventduetothenatureofthecameraanditslow mentationmethodtoclusterthevideomaterialintodiffer- framerate. entevents,whichwasspecificallydesignedforegocentric Lee et al. devised a method that aims to summarize video. They foundthat the methodbased on changesin life logging video material and goes beyond common color histograms which they used in previouswork [23] keyframe detection by focussing on “importance cues” does not really work well for egocentric video due to specific to the egocentric domain, such as objects and its continuous nature. Instead, they tried to distinguish people the camera wearer interacts with [23]. In partic- whetherthecamerawearerisstatic,intransit(physically ular, they segment each frame into multiple regions us- travelingfromonepointtoanother),ormovingthehead. ing a constrained parametric min-cuts method [24] and They learned an SVM to predict these scenarios based learn a regressor that predicts an importance score for ondenseopticalflowfeaturesandblurrinessscores[27]. eachregion. Thescoreisbasedonacombinationofvar- Theyfoundthatthismethodproducedevents(e.g. setsof ious features: interaction (euclidean distance of region frames)ofanaveragelengthof15seconds. Theyrepre- centroid to hand centroid, where hand is detected based sentedeacheventintermsofdetectedobjects.Forknown on skin color), gaze (euclidean distance to center), fre- environments, objects were represented as scores based quency(appearanceofregionovermultipleframesbased on a bank of object detectors and for uncontrolled en- onDoG+SIFTdescriptors),object-likeappearance(based vironments, objects were essentially visual words based on a ranking function of [24]), object-like motion, and on object-like windows [28]. They went on to consider likelihood of a person’s face within a region (using the each event as a node in a chain. Finding a story-driven Viola-Jones method [25]). They ended up temporally summary consisting of k frames then comes down to clustering the video into different events based on color finding the optimal, order-preserving K-node subchain histogram differences and represented each event with withrespecttostory,importanceanddiversityconstraints. the frame that has the highest importance score based Basically, the importance score was estimated similarly on the regressor. For training and evaluation, they used to their previous work [23], the story constraint favored Amazon’s Mechanical Turk to manually label and seg- event pairs with similar object instances, and the diver- ment important regions in their video data, which con- sity constraint made sure that sequential events are not sisted of multiple hours of daily life activities among too similar. They found a good chain with the approx- four different subjects. They evaluated the performance imate best-first search strategy described in [29]. They on classifyingimportantregionscorrectly(by threshold- evaluated their performance in the form of a user study ingthe regressor),aswellasthe qualityofthe keyframe basedontheirowndataset[23]aswellasthe“Activities summary. They found that their method performed bet- of Daily Living” dataset from [10]. To do so, they had ter in predicting important objects than object-like fea- 34subjectscomparetheirapproachwithothertechniques tures alone or low-level saliency methods. To quantify such as uniform sampling or their previous work [23]. the perceived quality of the keyframe summaries, they They found that an average of 87% of the subjects pre- askedthesubjectsthatworethecameratocomparetheir ferred their approachamong differentdatasets and base- methodwithbaselinemethods(suchasuniformsampling lines. 6 2.3.2 NoveltyDetection eventis viewedasmemorable. Theidea isthatdifferent kindsofsocialinteractionscanbedetected/recognizedby Aghazadehetal.[30]lookedatvideosfromasubjectwho faces and their spatial attention. For instance, a mono- recorded his one-hour commute to work multiple times, logueshouldhavemultipleobservingfacesattendingthe wearinganegocentriccamerathatcapturesoneimageper talking face. To modelthis, they first computedthe ori- second.Motivatedbytheideatouselifeloggingcameras entation of each detected face using the Pittpatt face de- as a memory support system for the disabled [21], they tectionsoftware1andthenusedthecamera’sintrinsicpa- proposed a method of novelty detection, where a novel rameters,aswellaspriorknowledgeoffacesizesatcer- event might be “meeting a friend” during the otherwise taindistancesinordertoestimatefacelocationsandori- similar sequences of the subject going to work. They entations in 3D. To get an estimate of the locations that achieved this by exploiting the invariant temporal order the faces are attending, they built an MRF that incorpo- oftheactivitiesacrossthedifferentsequencestoautomat- rates these 3D locations/orientationsas unary potentials, ically align a query sequence with the other sequences. but also uses pairwise potentialsbetween faces that bias The idea is that a bad alignment yields a novelty in the nearbyfacestowardslookingat the same locationin the queryactionasitislikelycausedbyaneventthathasnot scene. Theyusedanα-expansionmethodtooptimizethe beenobservedinthereferencesequences.Theyderiveda MRF. Having an estimate for each face’s attention, they similarity measure betweentwo framesbased on VLAD assigned rolesto individualfacesbasedon featuressuch (vector of locally aggregated descriptors, proposed by as the number of faces looking at x. Based on those, [31])aswellasgeometricsimilarities,representedbythe theycouldclassifyaninteractionasdialogue,discussion, epipolargeometrybetweenthe two frames(i.e. the fun- monologueandotherlabels, usinga HiddenConditional damentalmatrix). Comparingeachframefromthequery RandomField[33]thatalsoincorporatedtemporalinfor- sequencewitheachframefromareferencesequencecre- mation. They reported results for both attention estima- ates a cost matrix whose minimumcost path connecting tion as well as social interaction detection and recogni- the first and last frame (with the constraint that matches tion. Based on about 1000 hand-labeled frames, their havetooccurintemporalorder)yieldsthebestalignment method correctly estimated who is looking at whom in between the two sequences. Finally, if a framefrom the 71.4% of the cases. For detection, they presented ROC querysequencehasaminimummatchcostamongallref- curvesfordifferentformsofinteraction,wherethe aver- erencesequencesthatisabovesomethreshold,itiscon- ageareaunderthecurveis0.88.Theaveragerecognition sideredanovelty.From31sequencesofthesubjectgoing accuracywas55%(chancebeing20%). towork,fourofthemcontainedaneventthattheauthors considerednovelandallofthemweredetectedbytheal- gorithm. 3 Datasets 2.3.3 SocialInteractions Figure1givesacompactoverviewoveralldatasetsfrom the work mentioned in section 2 that are publicly avail- Fathi et al. [32] looked at egocentric life logging video able. We briefly describe the data as well as what kind for social events, in particular people spending a day at oflabelingisprovidedandalsolisttheURLstowebsites an amusementpark, and developeda methodforthe de- thatcontainfurtherexplanationsanddownloadlinks. tection and recognition of social interactions. This was Mostauthorstrytoestablishtheirowndatasetandcon- motivated by the idea that typically, one or more indi- sequently none of the datasets has taken over the role vidualshavetoplaytheroleofthe“groupvideographer” of a true benchmark dataset. An exception might be to capturememorableevents, which preventsthem from the “Intel 42 Objects” dataset for the task of egocentric fully participating in the group experience. Moreover, object recognition, which has also been used by [6] to a lot of memorable moments may occur spontaneously andtheauthors’thesis isthatthe presenceorabsenceof 1PittpatthassincebeenacquiredbyGoogleInc. andthesoftwareis social interactions is an important cue as to whether an notpubliclyavailableanymore. 7 Name Description Labeling Usedin URL Intel42Objects 10 video sequences (100K frames) eachframelabeledwithnameofob- [1,3,6] http://seattle.intel-research.net/ from two human subjects manipu- ject;exemplarphotosofobjectswith lating 42 everyday object instances forground/backgroundsegmentation suchascoffeepots,sponges,orcam- eras GeorgiaTech Egocen- 7 types of daily activities such each activity video is labeled with [6,12,16] http://www.cc.gatech.edu/˜afathi3/GTEA/ tricActivities(GTEA) as making a sandwhich/coffee/tea; list ofobjects being involved; each each performed by 4 different sub- framehaslefthand,righthand,and jects backgroundsegmentationmasks CMUkitchen multimodal dataset of 18 subjects eachframeislabeledwithanaction [8] http://kitchen.cs.cmu.edu/ cooking5different recipes (brown- suchas“takeoil”,“crackegg”,etc. ies,pizza,etc.);alsocontainsaudio, bodymotioncapture,andIMUdata ActivitiesofDailyLiv- 18 daily indoor activities such as 42 object classes that are involved [10,26] http://deepthought.ics.uci.edu/ADLdataset/ad ing brushing teeth, washing dishes, or in the activities are annotated with watchingtelevision,eachperformed boundingboxesinallframes by20differentsubjects GeorgiaTech Egocen- 7typesofmealpreparationsuchas each frame has eye gaze fixation [14] http://www.cc.gatech.edu/˜afathi3/GTEA_Gaze_Website/ tricActivities-Gaze+ makingpizza/pasta/salad; each per- data, timeframes ofdifferentactivi- formedby5differentsubjects tiessuchas“openfridge”areanno- tated UTEgocentric 4 videos from head-mounted cam- notavailable [23,26] http://vision.cs.utexas.edu/projects/egocent eras capturing a person’s day, each about3-5hourslong First-PersonSocialIn- day-longvideosof8subjectsspend- timeframes for different activities [32] http://www.cc.gatech.edu/˜afathi3/Disney/ teractions ingtheirdayatDisneyWorld (“waiting”, “train ride”, etc.) and socialinteractions(dialogue,discus- sion,etc.)areannotated Figure1: Overviewofpubliclyavailableegocentricvideodatasets. Rowonedealswithobjectrecognition.Rows2-5 dealwithactivitydetection/recognition.Rows6and7dealwithlifeloggingvideodata. 8 test the performance of their motion-based foreground- ter methods for keyframe extraction and summarization background segmentation method. Further, the “Activi- of egocentriclife loggingvideo. In contrast, Aghazadeh tiesofDailyLiving”datasetwasusedby[26]totesttheir et al. looked at life logging video of one subject over story-driven video summarization method. However, as multiple days and detected novel or out of the ordinary thisdatasetwasprimarilycollectedforthetaskofactivity activities. recognition[10],adirectcomparisonbetweenbothworks wasnotpossible. 5 Conclusion 4 Summary and Comparison In the previous sections, we gave a broad overview re- gardingthe differentproblemsin the domain of egocen- Inthissection,wesummarizethekeyaspectsofthework tric video that have recently been addressed in the com- that was introduced in the previous sections and draw putervision community. We showedthatresearchcould comparisonswherepossible. roughlybegroupedintothreecategories: objectrecogni- Ren and Philipose [1] were the first to test standard tion,activityandactiondetection,lifeloggingvideosum- recognition systems for the task of recognizing handled marization. All work in this domain is at a very early objects in egocentric video. They continued to find that stage:Thefirstpublicationsonegocentricobjectrecogni- foreground-backgroundsegmentationcansuccessfullybe tion[1]andactionsegmentation[8]datebacktothefirst donewithopticalflowbasedapproachesandhelpstoim- (outoftwo)IEEEworkshoponegocentricvisionduring prove the recognition results, as handled objects tend to CVPR 2009. Earlyworkonegocentricvideoinlifelog- beintheforeground[3]. Theirsegmentationmethodwas ging scenarios only dates back to 2008[22]. As one re- improved by Fathi et al., [6] who also were the first to sult of this, almost all publications introduce their own, consider multiple objects being manipulated as part of noveldatasetswhileworkingwithotherauthors’datare- kitchen activities like making sandwiches. Fathi et al. mainsthe exception. Consequently,no dominantbench- went on to experiment with various weakly supervised markdatasetshaveemergedsofarliketheyhaveinother approachesto recognizesuch activities, includingobject computervisionareassuchasgeneralobjectrecognition. co-occurrenceandchangesinobjectstates[12,16]. They Despite the novel nature of the egocentric vision do- are also the only groupto experimentwith the influence main,wecanseesometrendsthatspanacrossallresearch of gaze with respect to activity recognition [14]. Pirisi- categories: Egocentricvideois allaboutobjects. In first avash and Ramanan [10] were successful at recognizing personvideos,objectsofinteresttendtobenaturallycen- moreversatile householdactivities. However,unlike the teredandatalargescalewhilebeingsubjecttorelatively workofFathietal., theirmethodisstronglysupervised. little occlusion,whichmakesegocentricvideoverycon- RyooandMatthiesstartedlookingatinteractionlevelac- venientforobjectdetectionandclassification. Addition- tivitiessuchasshakinghands[17]. Theydiscoveredthat ally, opticalflow basedmethodsseem to workverywell activitiesthatcontainalotofego-motioncanbewellde- for the task of segmenting foreground objects (that are scribedwith opticalflow basedapproaches. Kitanietal. manipulated by hands) from background noise and are [19] came to similar conclusions when looking at sport used in almost all recent publications to improve recog- activitiesthatalsoinvolvealotofego-motion. nition results. This object-centered idea expands to ac- In parallel, researchers started looking at egocentric tionandactivityrecognition.Traditionalworkinthisarea videoforlifeloggingpurposes. Dohertyetal. 