ebook img

Complex Event Recognition from Images with Few Training Examples PDF

2.7 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Complex Event Recognition from Images with Few Training Examples

Complex Event Recognition from Images with Few Training Examples UnaizaAhsan∗ ChenSun∗∗ JamesHays∗ IrfanEssa∗ [email protected] [email protected] [email protected] [email protected] *GeorgiaInstituteofTechnology **UniversityofSouthernCalifornia1 7 1 0 Abstract 2 n Weproposetoleverageconcept-levelrepresentationsfor a complex event recognition in photographs given limited J trainingexamples. Weintroduceanovelframeworktodis- 7 covereventconceptattributesfromthewebandusethatto 1 extractsemanticfeaturesfromimagesandclassifytheminto ] socialeventcategorieswithfewtrainingexamples. Discov- V ered concepts include a variety of objects, scenes, actions C and event sub-types, leading to a discriminative and com- . pact representation for event images. Web images are ob- s c tained for each discovered event concept and we use (pre- [ trained)CNNfeaturestotrainconceptclassifiers.Extensive Figure 1: Event concepts as an intermediate feature repre- 1 experimentsonchallengingeventdatasetsdemonstratethat sentationforrecognizingsocialeventsinphotographs. v our proposed method outperforms several baselines using 9 deepCNNfeaturesdirectlyinclassifyingimagesintoevents 6 with limited training examples. We also demonstrate that stantlyevolvingspaceofeventsisnotrealistic,wepropose 7 ourmethodachievesthebestoverallaccuracyonadataset tolearnaneventconcept-basedrepresentationandleverage 4 0 withunseeneventcategoriesusingasingletrainingexam- that to identify rare events. Discovering web-driven con- . ple. cepts using Wikipedia and Flickr tags, we aim to catego- 1 0 rizesocialeventsfromstaticphotographswhenfewlabeled 7 examples are available. Images of social events inherently 1 1.Introduction consist of a combination of objects (e.g. ‘banner’), scenes : v (e.g.‘ground’),actions(e.g.‘shoutingslogans’),eventsub- Thewidespreadadoptionofsmart-phonescoupledwith i types (e.g. ‘speech’) and attributes (e.g. ‘protest peace- X easy to use photo sharing services has resulted in a mas- fully’).Objectappearancesignificantlychangeswhencom- r sive increase in images shared online. A large number a of these images are snapshots of special occasions, which bined with different objects, actions and attributes in clut- tered backgrounds. Hence recognizing events from static werefertoassocialeventssuchasbirthdays, graduations, images requires us to explicitly learn concept classifiers, weddings etc. or news events such as political campaigns, eachconceptacombinationofobjects,scenes,actionsand natural disasters or marathons. Many of these images do attributes. Wecalltheseeventconcepts(seeFigure1). nothavecleantextuallabelshenceidentifyingeventsfrom Event recognition approaches that use concepts or at- visual content alone is non-trivial. The recent success of tributeshavepreviouslybeenappliedtovideo-basedevents deepConvolutionalNeuralNetworks(CNNs)inobjectand [31,32,42,54]wheretemporaldynamicsplayanimportant scene recognition has resulted due to large labeled train- role in recognizing what is happening in the video. This ingdatabasessuchasImageNet[39]andPlaces[56]. Cur- makeseventrecognitionfromasinglephotographaninter- rent approaches which use pretrained CNNs and fine-tune esting and challenging problem domain. Several attribute- on datasets also require significant number of labeled ex- based recognition methods require datasets annotated with amples. Sincecreatinghugelabeleddatasetsfromthecon- alltheconcepts[20,23]whichisatediousprocess. Toby- 1TheauthorcurrentlyworksatGoogle. pass the need for manual concept labeling and inspired by conceptsfromimagesandtheirlabels[55],fromimagede- therecent‘weblysupervised’learningapproaches[5,6,10] scriptions [44] or by using a deep network [5] using prin- that use web content to discover visual concepts, we pro- ciplesfromcurriculumlearning. Ourproposedworkisin- poseaneventconceptlearningframeworkusingWikipedia spired by web supervision but for a different domain. The to generate event categories and Flickr tags as our initial key difference between our approach and other webly su- poolofconcepts. FromnoisyFlickrtags,wegenerateseg- pervised concept learning approaches is that our methods mentsorphrasesusingatweetsegmentationalgorithmpro- are designed to obtain event specific concepts. We explain posedbyLietal.[25]whichisamethoddesignedspecif- furtherinSection3. ically to extract event-centric phrases from noisy twitter The earliest work addressing event classification from streams. Finally, we project each event category on to a static images [28] classifies sports events (rowing, rock word embedding pretrained on the Google News Dataset climbing etc.) using object and scene information. Some using the popular word2vec [34] approach, extract nearest related approaches [2,16] require scene geometry or tem- neighbors and add them to the pool of segmented phrases. poralalignmentbetweeneventimagestoidentifyindividual We extract images related to each concept from MS Bing events. Recently [52] propose to train two deep networks; imagesearchengineandcomputedeepCNN[24]features one on images and the second one on spatial maps of de- extracted from a pretrained network on all the images and tected people/objects at different scales for event recogni- trainconceptclassifiers. Theconceptscorespredictedona tion. Our work aims to learn relevant concepts from the giventestimageformthefinalfeaturesforeventimages. web and uses pretrained CNNs for feature extraction thus Ourprimarycontributionsare: savingtrainingtime. Moreimportantly,trainingadeepnet- work for identifying events in images requires a large la- 1. A novel framework which involves using web data beleddataset. Weattempttoeliminatethatrequirementby to discover event related concepts and employing ef- discoveringandusinggeneraleventconceptsfromtheweb. ficient concept pruning strategies that result in clean, Motivated by the need to learn with few labeled exam- relevantanddiverseeventconcepts. ples, vision researchers have addressed one-shot learning 2. Aconcept-basedrepresentationthatnotonlyimproves to learn object classifiers [1,13,21,22,35] and more re- single-shot event classification performance but can cently used deep networks [15,18] and part-based mod- alsobegeneralizedtothosecategorieswhichwerenot els [49]. Video event recognition community suffers from usedduringconceptdiscovery. lack of labeled data for training too. Hence for recogniz- 3. A large scale Social Event Image Dataset (SocEID) ing events from few examples, many approaches use con- comprising37,000generaleventimagesbelongingto cepts as intermediate representations for event recognition 8eventcategoriesaswellasachallengingRareEvents [31,50]. Ma et al. [32] use labels in external videos as Dataset(RED)comprising7,000imagesbelongingto conceptsandjointlymodelconceptclassificationandevent 21specificrealworldevents. detection. Chen et al. [4] use Flickr tags to discover con- cepts for an event and its associated text description. Cui 2.RelatedWork et al. [9] propose Concept Bank which consists of events mined from WikiHow and concepts from Flickr tags. Ye Attributeshavebeenusedtodescribeobjects(bothfixed etal.[54]extendthepreviousworkandproposetoarrange [3,12,23,48] and relative [36]), faces [20], scenes [37] eventsandtheirconceptsinahierarchylearnedfromWik- andactions[14,30]. Theseattributedetectorsarethenrun iHow articles and YouTube descriptions. Shao et al. [41] on new images for high level recognition [46,47]. Re- generate video event attributes using crowd-based annota- searchershaveexploredcreatingaset(orbank)ofdetectors tion and use motion channels and appearance to train a pretrained on objects such as Object Banks [27], an ontol- deep model. Yang et al. [53] learn video concepts from ogy of abstract concepts such as Classemes [45] or scene YouTubedescriptionsandFlickrtags. Theygenerateevent attributes[8,37]. concepts using a tweet segmentation algorithm along with Another line of work proposes learning visual concepts othermetricsandtrainmultipleclassifiersforasinglecon- fromtheWebwithminimalhumansupervision(‘weblysu- cept. The major difference between their work and ours pervisedapproaches’). NEIL[6]usesimagesearchengine is that we begin from event labels instead of descriptions, results in a semi-supervised setting to learn and train vi- targetimage-basedeventrecognitionwhenfewlabeledex- sual concept detectors. LEVAN [10] uses Google NGram amplesareavailableandintegrateword2vecbasedconcepts corpustoextractallpossiblewordsrelatedtoagivencon- intoourconceptpool. cept, extracts images from image search engine and learn visualconceptsrelatedtoanygivenkeyword. Theauthors of [29] use a multiple instance learning approach to learn conceptsfromimagesearchresults.Someapproacheslearn airshows autoshow beautypageants formofcaptionsandthisformsthenoisycaptionpoolfrom allstargames balletshow beerfestivals whichweaimtogeneratemeaningfulconceptsthatcande- americanfootballmatch ballroomdance birdwatching scribe general social events. We empirically observe that annualprotests balls blackfriday by increasing the number of images from which we mine artexhibition barbecue boating Flickr tags, the noise in the data increases hence we limit artsfestivals baseball bowling ourselvesto200imagespereventcategory. astronomyevents basketballmatch boxing Table1: SampleeventsminedfromWikipedia Tag Segmentation: For the set of events E = {e ,e ,...e }where,forourworkn = 150,wehaveaset 1 2 n of tags T = {t ,t ,...t }, N = 200n in our case. Our 3.Approach 1 2 N goal is to generate consecutive and non-overlapping seg- We begin by asking the question, “What if all events ments S = {s1,s2,...sm}. These segments can be a sin- had limited labeled examples?” In reality, there are sev- gle word or phrases. We obtain a set of segments Si = eral events for which labeled image datasets are available {s1,s2,s3,...smi}⊂Sforeachtagti ∈T,i∈{1,...,N}, e.g.birthdaysandweddings. Wedevelopmethodsandtest by applying a tweet segmentation method [26] which can our approach on datasets containing these popular events bemodeledasanoptimization, bytakingonlyasinglelabeledexamplefromeachcategory as training data. The rest of the dataset is used for test- argmax Stk(t )=(cid:88)mi Stk(s ), (1) i j ing. This approach enables us to determine whether our s1,s2,s3,...smi j=1 proposedmethodsworkonpopulareventsbeforetakingon the harder task of identifying rare events (using our RED where Stk(·) is a function that computes the stickiness of Dataset)fromimages. a segment. Stickiness measures the probability that a seg- Our proposed method is an adaptation of the webly su- ment of text is a ‘named entity’ (name of a person, place pervisedlearningapproachestolearnvisualconceptsrele- or object) in a large text corpus or a knowledge base like vanttospecificeventcategories. Extractingallvisualcon- Wikipedia. Thefinalscoreforeachsegments isgivenby j ceptsrelatedtoakeywordsuchas‘birthday’fromtheweb using the approach of [10] results in many concepts that score(sj)=Stk(sj)·Vflickr(sj). (2) have either little to do with a birthday event or is not gen- In Equation 2 Stk(s ) refers to the stickiness score of eralizable to real world images. Those concepts include j the jth segment and V refers to visual representative- birthday settings advertised by event planners online and flickr ness which is a measure of how visually coherent a word objects associated with birthday with a clean background orphraseis. Wordslike‘economy,’and‘public,’ifqueried andacanonicalviewpoint[33]. Henceweargueforevent- to an image search engine, result in ambiguous images, specific concepts for complex event recognition from im- whereasaphraselike‘birthdaycake’generallyreturnsvery ageswithfewlabeledexamples. similar images. Hence it has a high visual representa- Our approach is divided into three main parts: Event tiveness score as compared to words like ‘economy’ or Concept Discovery, Training Concept Classifiers and Pre- ‘activity.’ We obtain visual representativeness scores for dictionofConceptScoresforEventClassification. each segment via a public dataset made available by Sun 3.1.EventConceptDiscovery andBhowmick[43]wheretheyproviderepresentativeness scoresforthemostpopulartagsonFlickr. Aftercomputing WeuseWikipediatominealistofeventsfromitscate- thefinalscores,weinspectthehighestscoringsegmentsto gory‘SocialEvents.’Thislistcontainsgeneralevents(such remove ambiguous or slang words. Figure 2 shows some as birthday) and specific events (such as royal wedding). sample tags and the returned high scoring segments. Fur- We do not include specific events in our initial events list therdetailsontagsegmentationcanbefoundin[26]. as the aim is to build a concept bank that is applicable for Generatingsegmentsfromtagsisnotenoughtocoverall all events. If we build a concept bank for royal wedding, aspects of a complex event. To expand the taxonomy and concepts such as ‘Kate Middleton,’ ‘Buckingham Palace’ find event-specific concepts we additionally project each etc.wouldbetoospecifictoapplytogenericeventimages. Thusweendupwith150genericsocialevents(seeTable1). ThislistofeventswasminedfromWikipediabutfiltered(to breakdancing,salsadancing,argentineantango,wowscrowd,freerunning,reggaehiphop, dance discofever,flashmobs,streetdance,dancers,breakbeat,hop,dancecraze,dancefest, removespecificevents)manually. bollywoodbhangra,asianpop,danceworkout,hiphopdancetroupe,breakdancers EacheventcategoryinourlistisusedtoqueryFlickrand Table2: Nearestneighborsoftheevent‘dance’in weobtainthefirst200imageresults. Wecollectthetags(a word2vecspace set of words describing the image) of those images in the Figure2: GeneratedsegmentsfromFlickrtagsforeventlabel‘protest.’ images returned for each concept. Clipart, duplicate im- agesandimagescontainingonlytextareremovedfromthe search results. Several concepts in C are correlated with each other, resulting in similar images for different con- cepts. Trainingthemindependentlywithouttakingintoac- counttheircorrelationwillleadtofalsenegativesthatwill impacttheconceptclassifiertrainingnegatively.Thereason why we do not exclude correlated concepts in our concept pool C is that they capture different (and thus important) aspects of a social event. For example, ‘birthday party,’ ‘birthday boy’ and ‘birthday celebrations’ are three differ- entconceptswithinthesameeventcategory(birthday)and Figure3: Eventconceptdiscoverypipelineforgenericso- theymayhavesimilarimages.Thuswhenselectingtraining cialevents. imagesfortrainingtheclassifierforconceptci,itisnaiveto samplenegativetrainingimagesfromallconceptsc where j j (cid:54)= i. Figure 4 shows an example of correlated concepts event label to word2vec space [34]. We use an embed- andtheirassociatedimages. ding trained on the Google News Dataset which consists Hence, we first cluster all the concepts using their ofabout100billionwords. Themodelisavailableforpub- word2vec-based vector representations using minibatch k- lic use (https://code.google.com/p/word2vec) and contains means clustering [40]. We set k = 150. Thus for ith 300-dimensional vectors for 3 million words and phrases. concept c , belonging to a y-sized cluster, we obtain a list i TheadvantageofusingGoogleNewspretrainedvectorsis of concepts Ci = (cid:8)c ,c ,...c (cid:9) that are pos i,pos1 i,pos2 i,posy thatweobtainsemanticallysimilarconceptsforeachevent highly correlated with the given concept c . We construct i label.Table2showsthenearestneighborsoftheeventlabel theconceptclassifiertrainingsetforconceptc asfollows: i ‘dance’inword2vecspace. • Let ξ+ = (cid:8)I (cid:9)u be the set of positive training im- Weextracttop20nearestneighborsforeacheventlabel i i=1 agesforclassificationwhereuisthenumberofimages andaddthemtothetagsegmentsgeneratedviathesegmen- retrievedforconceptc . tationschemedescribedabove. Thispoolofeventconcepts i is then filtered to remove duplicate concepts, slang words • Letξ− = (cid:8)I (cid:9)v bethesetofnegativetrainingim- j j=1 and foreign words. We finally end up with 856 event con- cepts.Ourconceptsnotonlyincludeobjects,scenesandac- tionsbutalsoincludesub-eventsandtheirtypes. Ourevent conceptdiscoverypipelineisshowninFigure3. 3.2.TrainingConceptClassifiers Now we describe our approach for selecting training images for concept classifiers. Given a set of concepts (cid:8) (cid:9) C = c ,c ,...c , we input each concept as an image 1 2 m search query to Microsoft Bing and retrieve the top 100 Figure4: Examplesofcorrelatedeventconcepts Figure5: Selectingtrainingimagesfor‘boxing’classifier ages for classification where v is the number of im- agesretrievedforthesetofconceptsCi = Ci = neg pos C−Ci . pos In other words, we make sure that the set ξ− does not in- clude images retrieved for any concept in the set Ci be- pos cause those images are highly similar to the images in ξ+ astheybelongtothesameclusterasc (SeeFigure5). i Foreachconcept,weextracttheCNN‘fc7’layeractiva- Figure 6: Sample images of the SocEID Dataset for two tionsasfeaturesfromallitsimages,selectthetrainingand events: birthday(top)andgraduation(bottom). test examples as described above and input them to logis- ticregressionclassifiers. Weselecttheclassifierparameters through 5-fold cross validation and for all of our concept 1. Social Event Image Dataset (SocEID): This is the classifiers,thecrossvalidationaccuracyisabove90%. dataset we created in-house. We collected images of the following social events: birthdays, graduations, weddings, 3.3.PredictingConceptScoresforClassification marathons/races,protests,parades,soccermatchesandcon- After training all concept classifiers, we compute the certs. We queried Instagram and Flickr with a tag related classifier scores on images belonging to our evaluation to the event itself (‘wedding day,’ ‘Graduation 2014’ etc.) datasets. ForeachimageI, itsfeaturevectorisaconcate- and downloaded public images in chronological order de- nation of all concept classifier scores predicted on the im- termined by post date. Our dataset includes some rele- (cid:8) (cid:9)m age. Thus f = x where m is the total number of vantimagesfromtheNUS-WIDEdataset[7]andtheSocial I i i=1 conceptsandx isthescorepredictedforithconceptclas- Event Classification subtask from MediaEval 2013 [38]. i sifier.Finally,weusethesefeaturestoclassifyeventimages Wepassedalltheimagesto3trustedcoders(non-Turkers) intosocialeventsusingalinearSVMwithdefaultparame- andaskedthemtofilteranyimagethatdidnotdepictapar- tersfixedforallexperiments. Weoutlineourexperimental ticular social event. We established the final scores on the setupindetailinthenextsection. imagesviamajorityvoteanddiscardedalltherest. Finally, we ended up with nearly 37,000 images. Figure 6 shows 4.ExperimentsandEvaluations sampleimagesfromtheSocEIDdataset. Weevaluateourapproachonfourlabeledeventdatasets 2.WebImageDatasetforEventRecognition(WIDER): withone-shotlearning,thatis,weuseasinglepositiveand This dataset is introduced by [52] and consists of 50,574 negative training example from each class. We also re- imagesannotatedwith61classes. portclassificationresultsforall-shotlearning(usinga70-30 split)forcomparison. 3. UIUC Sports Event Dataset: This dataset [28] con- 4.1.Datasets sistsof1579imagesbelongingto8sportseventscategories suchasbadminton,bocce,sailingetc. We evaluate our concept-based event recognition algo- rithmonthefollowingfourdatasets: Figure7: SampleimagesoftheRareEventsDataset. 4.RareEventsDataset(RED): Thisisanotherin-house which is a publicly available CNN model pretrained on datasetwecollectedbyqueryingMSBingimagesearchen- 978objectcategoriesfromImageNetdatabase[39]and205 ginewithasetof26‘rare’eventcategories. Wecallthem scenecategoriesfromPlacesdataset[56]usingtheAlexNet rare not on the basis of how frequently they occur in the deep architechture [19]. For each concept, we select the world but on how seldom they are found in large labeled positiveandnegativetrainingfeaturesasdescribedinSec- event image datasets. These event categories comprise of tion3.2 andtrainL2-regularizedlogisticregression classi- recentnewseventssuchas:JustinTrudeauelected,election fiersusingthepubliclyavailableLIBLINEARlibrary[11]. campaign Trump and natural disasters such as Hurricane Every image in our event datasets is input to each of the Katrina, Hurricane Sandy and Nepal earthquake (see Fig- trainedconceptclassifiersandaprobabilisticconceptscore ure 7). The whole dataset comprises nearly 7,000 images is computed on it. The fusing of concept scores form the and we do not remove any image from any event category finalfeaturevectorofthatimage. manually. Notethattheseareallspecificevents(thefulllist can be found in the supplementary section) and our main 4.2.1 TrainingwithaSinglePositiveImage motivationbehindcollectingthisdatasetistwofold:i)Since few labeled examples are available for these events, it is a Weconductourone-shotlearningexperimentontheevent suitabletestcaseforourclaimthatourlearnedconceptsare datasets as follows: For event category E with P positive a powerful intermediate representation to recognize events training features and N negative training features we ran- withfewexamples,ii)Wewanttotestwhetherourdiscov- domlysampleapositivetrainingfeaturefpE andanegative eredeventconceptsgeneralizetorecognizingspecificevent training feature fE. We concatenate the two and feed this n imagesornot. into a binary linear SVM as training features. For testing, we simply take the rest of the positive features (P −fE) p rowingchampionships,rowingregatta,recreationalboating,canoetrip,juniorrowing, and sample an equal number of negative features from the rowing canoepolo,standuppandling,swimmingcanoeing,recreationalfishing,cablewakeboard horsebackriding,horseshow,mountainbiking,broncriding,barebackbronc, restoftheeventcategoriesinthedataset. Henceforallour polo horsebackride,ranchrodeo,stampederodeo,canoeinghorseback,ridingmountain experiments, the random baseline is 50%. We run all ex- perimentsfivetimesandaveragetheper-classclassification Table3: Top10predictedconceptsforsportsevents accuracies. ‘rowing’and‘polo’ 4.2.2 Trainingwitha70%-30%split 4.2.ExperimentalSetup In this experiment we take all of the labeled data into ac- count and for each class, randomly select 70% of images Webeginourexperimentsbytrainingeventconceptclas- fortrainingandtestontheremainingimages. Thisexperi- sifiers.Wehaveatotalof86,000imagesassociatedwiththe mentshowsthemaximumaccuracyourmethodcanachieve concepts,retrievedfromMicrosoftBingusingthepublicly given all of the training data available. It provides a nice availableBingcrawlerbyDengxinDai.1 WeusetheCaffe comparisonagainstthecasewhereonlyasinglelabeledim- [17]deeplearningframeworktoextractCNNlayer7activa- ageisavailableforeachclass. Wecompareourresultswith tions(‘fc7’)asfeaturesforalltheimagesusingHybridCNN severalpowerfulbaselines: 1http://www.vision.ee.ethz.ch/˜daid • AlexNet [19] pretrained on ImageNet [39] and Figure8: One-shotlearningresultsonUIUCSportsDatasetandSocEID. Features 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 AlexNet-fc7 57.51 55.76 50.94 53.26 50.45 65.23 52.70 51.56 53.66 55.50 53.01 52.38 54.53 57.69 58.86 WEBLY-fc7 57.97 55.54 53.05 53.07 57.43 64.59 54.32 54.26 55.46 60.19 56.03 60.07 55.13 56.31 60.70 Eventconcepts 60.46 56.62 63.64 52.16 59.63 64.96 51.71 51.59 54.80 56.87 57.82 59.29 49.00 57.05 59.47 Table4: Resultofone-shotlearningonWIDERDataset Places [56] databases, from which we extract 4096- badminton, bocce, croquet, polo, rock climbing, rowing, dimensional layer fc7’s activations and use them as sailing and snowboarding. The main reason why this oc- features. We refer to this baseline as AlexNet-fc7 in curs is that our initial event list and hence our discovered theresults. event concepts include several sports events and their sub- types.Forexample,foralltheimageslabeledwiththeevent • Chen et al. [5] which is a recently proposed webly “rowing”and“polo”intheUIUCSportsDataset,wecount supervised CNN trained on about 2.1 million images thetop10mostfrequentlypredictedconcepts.Table3qual- downloaded from Google Images using popular vi- itatively shows that our method extracts relevant concepts sion datasets’ labels as search queries. The authors consistentlyaccrossthesetofUIUCSportseventsimages. use 2,240 objects, 89 attributes, and 874 scene labels Ourone-shotlearningexperimentontheSocEIDdataset from ImageNet [39], SUN database [51] and NEIL (Figure8right)resultsineventconceptsoutperformingthe knowledgebase[6]anduseprinciplesfromcurriculum baselines in 4 out of 8 categories. From 1-8, the cate- learning to train the network with easy examples first gories are: birthday, concert, graduation, marathons, pa- andthenhardexamplesfromFlickr. Theyshowstate rade, protest, soccer and wedding. This is very likely due of the art performance compared to AlexNet for ob- to the nature of images found in our dataset. Our dataset jection detection and scene recognition. We use their contains very clean images from the Web with significant ‘GoogleA’ network which is trained on 2.1 million visual cues of popular events such as birthdays. Thus the Googleimages. WerefertothisbaselineasWEBLY- weblysupervisednetworkof[5]andWIDER[52]finetuned fc7. oneventimagesisabletodiscriminatebetweenthediffer- enteventclassesbasedoncuessuchasgraduationcapsand • AlexNet[19]finetunedonWIDERdatabase[52].This bridalgownsinthegraduationandweddingpicturesrespec- baselinemodelisprovidedbytheauthors. Wewantto tivelyeveniflimitedtrainingexamplesarepresent. seehowfc7featuresextractedfromthismodelperform Our experiments on the WIDER dataset [52] yield in- with limited training data on our evaluation datasets teresting insights. There are 61 classes in the dataset (except WIDER) and whether our concept-level fea- (the full list can be found in our supplementary material). turesarecomparabletoit. Werefertothisbaselineas For brevity, we only show the first 15 classes (we fol- WIDER-fc7. low the order given by the authors). We test one-shot 5.ResultsandDiscussion learning using our event concept features and compare them against the baselines, WEBLY-fc7 and AlexNet-fc7. Our one-shot learning result on UIUC Sports Events Our event concept features outperform AlexNet-fc7 and dataset (Figure 8 left) shows that the event concept fea- WEBLY-fc7 in 31 classes when training with a single im- tures significantly outperform all the baselines in 6 out of age. Table 4 shows our results. From 1-15, the classes 8events. From1-8, theUIUCSportseventcategoriesare: Eventclasses 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 Avg.Acc AlexNet-fc7 54.0 53.4 60.0 56.8 53.3 54.0 53.0 50.3 59.9 56.0 52.5 57.5 65.3 54.0 53.6 53.2 55.5 55.5 64.0 50.7 62.6 56.0 WEBLY-fc7 52.2 58.1 60.9 56.2 53.9 56.3 50.4 55.5 67.8 54.9 61.0 61.3 59.6 55.6 47.8 61.0 53.2 52.1 56.0 52.6 69.0 57.0 WIDER-fc7 50.4 55.5 57.2 52.2 55.8 55.1 51.1 52.8 55.5 52.6 52.6 58.9 63.3 53.7 47.3 49.8 53.4 55.6 61.3 52.8 59.3 54.6 Eventconcepts 57.5 58.9 67.7 55.5 54.8 53.1 53.9 58.6 75.1 54.2 60.8 55.4 73.4 56.2 52.0 58.0 56.2 52.4 58.7 53.3 64.8 58.6 Table5: Resultofone-shotlearningonREDDataset OverallAverageAccuracy(%) OverallAverageAccuracy(%) Features UIUCSports SocEID WIDER RED Features UIUCSports SocEID WIDER RED AlexNet-fc7 59.79 52.57 55.40 55.94 AlexNet-fc7 96.47 86.42 77.94 77.86 WEBLY-fc7 55.39 60.89 58.14 56.92 WEBLY-fc7 95.16 83.66 77.85 79.39 WIDER-fc7 70.81 58.11 N/A 54.58 WIDER-fc7 93.85 80.42 N/A 76.64 Eventconcepts 82.09 62.98 58.29 58.59 Eventconcepts 96.68 85.39 78.59 77.57 Table6: Overallaccuracyforone-shotlearningonthe Table7: Overallaccuracyforall-shotlearningonthe evaluationdatasets evaluationdatasets are: parade, handshaking, demonstration, riot, dancing, We also evaluate our method against the baselines us- car-accident, funeral, cheering, election-campaign, press- ingalltheavailabletrainingdata(70%-30%split). Table7 conference, people-marching, meeting, group, interview showstheoverallclassificationaccuraciesonourevaluation and traffic. We note that the WIDER dataset contains not datasets. Evenwhenusingalltrainingexamples,ourevent only events but also individual actions such as ‘cheering’. concept features are comparable to the state of the art in Our proposed approach uses concepts such as ‘cheering’ recognizing events from images. For two datasets, (UIUC toidentifyeventswhichtypicallyinvolvecheeringsuchas Sports and WIDER) our method actually outperforms the dances,gamesorgraduations(asthecheeringconceptwill stateoftheart. resultinhighprobabilisticpredictiononsuchevents).How- ever, the model is not trained to identify cheering alone. 6.Conclusion This can be verified by noting our performance scores in Inthispaperweproposetodiscoverevent-specificcon- classes such as parade (1), demonstration (3), dancing (5) cepts from the web to recognize complex events from im- etc. Wescorewellonothercategories(notshownintable) ageswithfewlabeledexamples. Ourproposedframework suchasbasketball,soccer,running,aerobicsetc.Categories discoversrelevantconceptsbycombiningsegmentedFlickr where our scores are comparable but not above the base- tags and word2vec nearest neighbors of event categories lines are: sports coach trainer, greeting, surgeons, spa and resulting in a compact intermediate representation which stock market, to name a few. These categories are recog- identifies real world events with only a single training ex- nized more effectively by deep CNNs pretrained to recog- ample. We show the strength of our proposed method by nizeobjectsandscenes. evaluating on challenging datasets against powerful base- Finally,weevaluateourmethodontheREDdataset(see lineswhichdirectlyuseCNNfeaturespretrainedonobjects, Table5)whichisthemostchallengingbecausetheimages scenes,attributesandevents. Itisinterestingtonotethatin arehighlydiverseandconsistofspecificrealworldevents. the problem domain of event recognition from visual con- In one-shot learning on RED, for 10 out of 21 classes, tentwhereonlyafewtrainingexamplesareavailable,web- our proposed method outperfoms the baselines. Thus our driven concept discovery and web images for training can method is able to generalize to unseen event categories as resultinhighlydiscriminativeintermediaterepresentations our initial event list does not contain any of the rare event whichoutperformdirectlyusingdeepCNNstrainedonmil- categoriesintheREDdataset. Fromlefttoright,categories lions of images and even deep CNNs finetuned on a large are: ‘Russian airstrikes Syria,’ ‘Boston Bombing,’ ‘Nepal eventdataset. Earthquake,’ ‘Arab Spring,’ etc. (the full list can be found inthesupplementarymaterial). References The overall classification accuracies are shown in Ta- ble 7. For all the datasets, event image classification us- [1] E. Bart and S. Ullman. Cross-generalization: Learning ingasingletrainingexamplewithourproposedeventcon- novelclassesfromasingleexamplebyfeaturereplacement. ceptsasfeaturesoutperformthebaselineswhichshowsthe In Computer Vision and Pattern Recognition, 2005. CVPR strength of our approach to recognize complex real world 2005. IEEE Computer Society Conference on, volume 1, eventswhenlimitedlabeledexamplesareavailable. pages672–679.IEEE,2005. [2] L.Bossard, M.Guillaumin, andL.Van. Eventrecognition tionalarchitectureforfastfeatureembedding.arXivpreprint inphotocollectionswithastopwatchhmm.InComputerVi- arXiv:1408.5093,2014. sion(ICCV),2013IEEEInternationalConferenceon,pages [18] G. Koch. Siamese Neural Networks for One-Shot Image 1193–1200.IEEE,2013. Recognition. PhDthesis,UniversityofToronto,2015. [3] H.Chen,A.Gallagher,andB.Girod.Describingclothingby [19] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet semanticattributes. InComputerVision–ECCV2012,pages classification with deep convolutional neural networks. In 609–623.Springer,2012. Advances in neural information processing systems, pages [4] J.Chen,Y.Cui,G.Ye,D.Liu,andS.-F.Chang.Event-driven 1097–1105,2012. semanticconceptdiscoverybyexploitingweaklytaggedin- [20] N. Kumar, A. C. Berg, P. N. Belhumeur, and S. K. Nayar. ternet images. In Proceedings of International Conference Attributeandsimileclassifiersforfaceverification. InCom- onMultimediaRetrieval,page1.ACM,2014. puterVision, 2009IEEE12thInternationalConferenceon, [5] X.ChenandA.Gupta. Weblysupervisedlearningofconvo- pages365–372.IEEE,2009. lutionalnetworks. InProceedingsoftheIEEEInternational [21] B. M. Lake, R. Salakhutdinov, J. Gross, and J. B. Tenen- ConferenceonComputerVision,pages1431–1439,2015. baum. Oneshotlearningofsimplevisualconcepts. InPro- [6] X.Chen,A.Shrivastava,andA.Gupta. Neil: Extractingvi- ceedingsofthe33rdAnnualConferenceoftheCognitiveSci- sualknowledgefromwebdata. InComputerVision(ICCV), enceSociety,volume172,page2,2011. 2013IEEEInternationalConferenceon,pages1409–1416. [22] B.M.Lake,R.R.Salakhutdinov,andJ.Tenenbaum. One- IEEE,2013. shotlearningbyinvertingacompositionalcausalprocess.In [7] T.-S.Chua,J.Tang,R.Hong,H.Li,Z.Luo,andY.-T.Zheng. Advances in neural information processing systems, pages Nus-wide: Areal-worldwebimagedatabasefromnational 2526–2534,2013. university of singapore. In Proc. of ACM Conf. on Image [23] C. H. Lampert, H. Nickisch, and S. Harmeling. Learning and Video Retrieval (CIVR’09), Santorini, Greece., July 8- to detect unseen object classes by between-class attribute 10,2009. transfer. InComputerVisionandPatternRecognition,2009. [8] G.Ciocca,C.Cusano,S.Santini,andR.Schettini. Halfway CVPR 2009. IEEE Conference on, pages 951–958. IEEE, through the semantic gap: Prosemantic features for image 2009. retrieval. InformationSciences,181(22):4943–4958,2011. [24] B. B. Le Cun, J. S. Denker, D. Henderson, R. E. Howard, [9] Y.Cui, D.Liu, J.Chen, andS.-F.Chang. Buildingalarge W.Hubbard,andL.D.Jackel.Handwrittendigitrecognition conceptbankforrepresentingeventsinvideo.arXivpreprint withaback-propagationnetwork. InAdvancesinneuralin- arXiv:1403.7591,2014. formationprocessingsystems.Citeseer,1990. [10] S. K. Divvala, A. Farhadi, and C. Guestrin. Learning [25] C.Li,A.Sun,andA.Datta. Twevent: segment-basedevent everything about anything: Webly-supervised visual con- detectionfromtweets. InProceedingsofthe21stACMin- cept learning. In Computer Vision and Pattern Recogni- ternationalconferenceonInformationandknowledgeman- tion(CVPR),2014IEEEConferenceon,pages3270–3277. agement,pages155–164.ACM,2012. IEEE,2014. [26] C.Li,J.Weng,Q.He,Y.Yao,A.Datta,A.Sun,andB.-S. [11] R.-E.Fan,K.-W.Chang,C.-J.Hsieh,X.-R.Wang,andC.-J. Lee. Twiner: named entity recognition in targeted twitter Lin. Liblinear: Alibraryforlargelinearclassification. The stream.InProceedingsofthe35thinternationalACMSIGIR JournalofMachineLearningResearch,9:1871–1874,2008. conferenceonResearchanddevelopmentininformationre- [12] A.Farhadi,I.Endres,D.Hoiem,andD.Forsyth.Describing trieval,pages721–730.ACM,2012. objectsbytheirattributes. InComputerVisionandPattern [27] F.Li,J.Carreira,andC.Sminchisescu. Objectrecognition Recognition,2009.CVPR2009.IEEEConferenceon,pages asrankingholisticfigure-groundhypotheses. InComputer 1778–1785.IEEE,2009. VisionandPatternRecognition(CVPR),2010IEEEConfer- [13] L.Fei-Fei,R.Fergus,andP.Perona.One-shotlearningofob- enceon,pages1712–1719.IEEE,2010. jectcategories. PatternAnalysisandMachineIntelligence, [28] L.-J.LiandL.Fei-Fei. What,whereandwho? classifying IEEETransactionson,28(4):594–611,2006. eventsbysceneandobjectrecognition. InComputerVision, [14] Y.Fu,T.M.Hospedales,T.Xiang,andS.Gong. Attribute 2007.ICCV2007.IEEE11thInternationalConferenceon, learning for understanding unstructured social activity. In pages1–8.IEEE,2007. Computer Vision–ECCV 2012, pages 530–543. Springer, [29] Q. Li, J. Wu, and Z. Tu. Harvesting mid-level visual con- 2012. ceptsfromlarge-scaleinternetimages.InProceedingsofthe [15] D.Held,S.Thrun,andS.Savarese.Deeplearningforsingle- IEEEConferenceonComputerVisionandPatternRecogni- viewinstancerecognition.arXivpreprintarXiv:1507.08286, tion,pages851–858,2013. 2015. [30] J.Liu,B.Kuipers,andS.Savarese. Recognizinghumanac- [16] V. Jain, A. Singhal, and J. Luo. Selective hidden random tionsbyattributes.InComputerVisionandPatternRecogni- fields: Exploitingdomain-specificsaliencyforeventclassi- tion(CVPR),2011IEEEConferenceon,pages3337–3344. fication. InComputerVisionandPatternRecognition,2008. IEEE,2011. CVPR2008.IEEEConferenceon,pages1–8.IEEE,2008. [31] J.Liu,Q.Yu,O.Javed,S.Ali,A.Tamrakar,A.Divakaran, [17] Y.Jia,E.Shelhamer,J.Donahue,S.Karayev,J.Long,R.Gir- H. Cheng, and H. Sawhney. Video event recognition us- shick, S. Guadarrama, and T. Darrell. Caffe: Convolu- ing conceptattributes. InApplications ofComputer Vision (WACV), 2013 IEEE Workshop on, pages 339–346. IEEE, [46] J.VogelandB.Schiele.Semanticmodelingofnaturalscenes 2013. forcontent-basedimageretrieval. InternationalJournalof [32] Z. Ma, Y. Yang, Z. Xu, S. Yan, N. Sebe, and A. G. ComputerVision,72(2):133–157,2007. Hauptmann. Complex event detection via multi-source [47] G.Wang,D.Hoiem,andD.Forsyth. Learningimagesim- videoattributes. InComputerVisionandPatternRecogni- ilarity from flickr groups using stochastic intersection ker- tion(CVPR),2013IEEEConferenceon,pages2627–2633. nelmachines. InComputerVision,2009IEEE12thInterna- IEEE,2013. tionalConferenceon,pages428–435.IEEE,2009. [33] E.MezumanandY.Weiss. Learningaboutcanonicalviews [48] P.Welinder,S.Branson,T.Mita,C.Wah,F.Schroff,S.Be- frominternetimagecollections. InAdvancesinNeuralIn- longie,andP.Perona. Caltech-ucsdbirds200. 2010. formationProcessingSystems,pages719–727,2012. [49] A. Wong and A. L. Yuille. One shot learning via compo- [34] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient sitionsofmeaningfulpatches. InProceedingsoftheIEEE estimation of word representations in vector space. arXiv InternationalConferenceonComputerVision,pages1197– preprintarXiv:1301.3781,2013. 1205,2015. [35] E. G. Miller, N. E. Matsakis, and P. A. Viola. Learning [50] S.Wu,S.Bondugula,F.Luisier,X.Zhuang,andP.Natara- from one example through shared densities on transforms. jan. Zero-shoteventdetectionusingmulti-modalfusionof InComputerVisionandPatternRecognition,2000.Proceed- weakly supervised concepts. In Computer Vision and Pat- ings.IEEEConferenceon,volume1,pages464–471.IEEE, ternRecognition(CVPR),2014IEEEConferenceon,pages 2000. 2665–2672.IEEE,2014. [36] D. Parikh and K. Grauman. Relative attributes. In Com- [51] J.Xiao, J.Hays, K.A.Ehinger, A.Oliva, andA.Torralba. puter Vision (ICCV), 2011 IEEE International Conference Sundatabase: Large-scalescenerecognitionfromabbeyto on,pages503–510.IEEE,2011. zoo. In Computer vision and pattern recognition (CVPR), [37] G.PattersonandJ.Hays. Sunattributedatabase: Discover- 2010IEEEconferenceon,pages3485–3492.IEEE,2010. ing, annotating, and recognizing scene attributes. In Com- [52] Y.Xiong,K.Zhu,D.Lin,andX.Tang. Recognizecomplex puter Vision and Pattern Recognition (CVPR), 2012 IEEE eventsfromstaticimagesbyfusingdeepchannels. InPro- Conferenceon,pages2751–2758.IEEE,2012. ceedings of the IEEE Conference on Computer Vision and [38] T.Reuter,S.Papadopoulos,G.Petkos,V.Mezaris,Y.Kom- PatternRecognition,pages1600–1609,2015. patsiaris,P.Cimiano,C.deVries,andS.Geva. Socialevent [53] X. Yang, T. Zhang, C. Xu, and M. S. Hossain. Automatic detectionatmediaeval2013:Challenges,datasets,andeval- visualconceptlearningforsocialeventunderstanding. Mul- uation. In Proceedings of the MediaEval 2013 Multime- timedia,IEEETransactionson,17(3):346–358,2015. diaBenchmarkWorkshopBarcelona,Spain,October18-19, [54] G. Ye, Y. Li, H. Xu, D. Liu, and S.-F. Chang. Event- 2013,2013. net: A large scale structured concept library for complex [39] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, eventdetectioninvideo. InProceedingsofthe23rdAnnual S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, ACM Conference on Multimedia Conference, pages 471– A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual 480.ACM,2015. RecognitionChallenge. InternationalJournalofComputer [55] B.Zhou, V.Jagadeesh, andR.Piramuthu. Conceptlearner: Vision(IJCV),pages1–42,April2015. Discoveringvisualconceptsfromweaklylabeledimagecol- [40] D.Sculley.Web-scalek-meansclustering.InProceedingsof lections. In Proceedings of the IEEE Conference on Com- the19thinternationalconferenceonWorldwideweb,pages puter Vision and Pattern Recognition, pages 1492–1500, 1177–1178.ACM,2010. 2015. [41] J.Shao,K.Kang,C.C.Loy,andX.Wang. Deeplylearned [56] B.Zhou, A.Lapedriza, J.Xiao, A.Torralba, andA.Oliva. attributesforcrowdedsceneunderstanding. InProc.CVPR, Learning deep features for scene recognition using places 2015. database.InAdvancesinNeuralInformationProcessingSys- [42] B. Singh, X. Han, Z. Wu, V. I. Morariu, and L. S. Davis. tems,pages487–495,2014. Selectingrelevantwebtrainedconceptsforautomatedevent retrieval. InProceedingsoftheIEEEInternationalConfer- enceonComputerVision,pages4561–4569,2015. [43] A.SunandS.S.Bhowmick.Quantifyingtagrepresentative- ness of visual content of social images. In Proceedings of theinternationalconferenceonMultimedia,pages471–480. ACM,2010. [44] C.Sun,C.Gan,andR.Nevatia. Automaticconceptdiscov- ery from parallel text and visual corpora. In Proceedings of the IEEE International Conference on Computer Vision, pages2596–2604,2015. [45] L. Torresani, M. Szummer, and A. Fitzgibbon. Efficient object category recognition using classemes. In Computer Vision–ECCV2010,pages776–789.Springer,2010.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.