VEZHNEVETSANDFERRARI:LOOKINGOUTOFTHEWINDOW 1 5 Object localization in ImageNet 1 0 2by looking out of the window g u AlexanderVezhnevets UniversityofEdinburgh A [email protected] Edinburgh,Scotland,UK 4 VittorioFerrari [email protected] ] V C Abstract . s c We propose a method for annotating the location of objects in ImageNet. Tradi- [ tionally,thisiscastasanimagewindowclassificationproblem,whereeachwindowis consideredindependentlyandscoredbasedonitsappearancealone.Instead,wepropose 2 a method which scores each candidate window in the context of all other windows in v theimage,takingintoaccounttheirsimilarityinappearancespaceaswellastheirspa- 1 8 tialrelationsintheimageplane. Wedeviseafastandexactproceduretooptimizeour 1 scoring function over all candidate windows in an image, and we learn its parameters 1 using structured output regression. We demonstrate on 92000 images from ImageNet 0 thatthissignificantlyimproveslocalizationoverrecenttechniquesthatscorewindowsin . isolation[32,35]. 1 0 51 Introduction 1 : vThe ImageNet database [9] contains over 14 million images annotated by the class label i Xof the main object they contain. However, only a fraction of them have bounding-box an- notations (10%). Automatically annotating object locations in ImageNet is a challenging r aproblem,whichhasrecentlydrawnattention[15,16,35]. Theseannotationscouldbeused astrainingdataforproblemssuchasobjectclassdetection[8],tracking[21]andposeesti- mation [4]. Traditionally, object localization is cast as an image window scoring problem, whereascoringfunctionistrainedonimageswithbounding-boxesandappliedtooneswith- out. The image is first decomposed into candidate windows, typically by object proposal generation[3,12,23,32]. Eachwindowisthenscoredbyaclassifiertrainedtodiscriminate instancesoftheclassfromotherwindows[8, 14, 15, 17, 32, 34, 36]oraregressortrained to predict their overlap with the object [6, 20, 33, 35]. Highly scored windows are finally deemedtocontaintheobject. Inthisparadigm,theclassifierlooksatonewindowatatime, makingadecisionbasedonlyonthatwindow’sappearance. We believe there is more information in the collection of windows in an image. By takingintoaccounttheappearanceofallwindowsatthesametimeandconnectingittotheir spatialrelationsintheimageplane,wecouldgobeyondwhatcanbedonebylookingatone window at a time. Consider the baseball in fig. 1(a). For a traditional method to succeed, theappearanceclassifierneedstoscorethewindowonthebaseballhigherthanthewindows containingit. Thecontainerwindowscannothelpexceptbyscoringlowerandbediscarded. By considering one window at a time with a classifier that only tries to predict whether it (cid:13)c 2015.Thecopyrightofthisdocumentresideswithitsauthors. Itmaybedistributedunchangedfreelyinprintorelectronicforms. 2 VEZHNEVETSANDFERRARI:LOOKINGOUTOFTHEWINDOW (a) (b) Image space Window appearance space Image space Window appearance space Figure 1: Connecting the appearance and window position spaces. (a) a window tight on the baseball(greenstarintheappearancespaceplot)andsomelargerwindowscontainingit(redcircles intheappearancespace). Blackpointsinappearancespacerepresentallothercandidatewindows. (b)allwindowswithhighoverlapwiththewolfareshowninred,bothintheimageandinappearance space. The ground-truth bounding-box to be found is shown in green. The appearance space plots areactualdatapoints,representingwindowsin3-dimensionalAssociativeEmbeddingofSURFbag- of-words.Pleaseseemaintextfordiscussion. coverstheobjecttightly, onecannotdomuchmorethanthat. Thefirstkeyelementofour workistopredictricherspatialrelationsbetweeneachcandidatewindowandtheobjectto be detected, including part and container relations. The second key element is to employ thesepredictionstoreasonaboutrelationsbetweendifferentwindows. Inthisexample,the containerwindowsarepredictedtocontainasmallertargetobjectsomewhereinsidethem, andtherebyactivelyhelpbyreinforcingthescoreofthebaseballwindow.Fig.1(b)illustrates another example of the benefits of analyzing all windows jointly. Several windows which have high overlap with each other and with the wolf form a dense cluster in appearance space, making it hard to discriminate the precise bounding-box by its appearance alone. However,theprecisebounding-boxispositionedatanextremepointofthecluster—onthe tip. By considering the configuration of all the windows in appearance space together we canreinforceitsscore. Inanutshell,weproposetolocalizeobjectsinImageNetbyscoringeachcandidatewin- dowinthecontextofallotherwindowsintheimage,takingintoaccounttheirsimilarityin appearance space as well as their spatial relations in the image plane. To represent spatial relations of windows we propose a descriptor indicative of the part/container relationship of the two windows and of how well aligned they are (sec. 2). We learn a windows ap- pearancesimilaritykernelusingtherecentAssociativeEmbeddingtechnique[35](sec.3). Wedescribeeachwindowwithasetofhyper-featuresconnectingtheappearancesimilarity andspatialrelationsofthatwindowtoallotherwindowsinthesameimage. Thesehyper- featuresareindicativeoftheobject’spresencewhentheappearanceofawindowaloneisnot enough(e.g. fig1). Thesehyper-featuresarethenlinearlycombinedintoanoverallscoring function(sec.4). Wedeviseafastandexactproceduretooptimizeourscoringfunctionover allcandidatewindowsinatestimage(sec.4.1),andwelearnitsparametersusingstructured outputregression[31](sec.4.2). WeevaluateourmethodonasubsetofImageNetcontaining219classeswithmorethan 92000 images [15, 16, 35]. The experiments show that our method outperforms a recent approach for this task [35], an MKL-SVM baseline [34] based on the same features, and thepopularUVAobjectdetector[32]. Theremainderofthepaperisorganizedasfollows. Sec. 2 and 3 introduce the spatial relation descriptors which we use in sec. 4 to define our newlocalizationmodel. Insec.5wereviewrelatedwork. Experimentsandconclusionsare presentedinsec.6. VEZHNEVETSANDFERRARI:LOOKINGOUTOFTHEWINDOW 3 A E B D 7 C (a) (b) Figure2: Spatialrelationsρ(w,w(cid:48))betweentwowindows.(a)Thefirstelementindicatehowmuch wandw(cid:48)overlap,inthetraditionalPASCALVOCsense[13]. Thesecondelementindicateswhether windowwisapartofw(cid:48). Thethirdelementmeasureswhetherwisacontainerofw(cid:48). (b)Computing thespatialrelationsρ(wi,wl)andρ(wj,wl)forhyper-featuresφG(3)andφL(4). 2 Describing spatial relations between windows Candidatewindows. Recently, objectclassdetectorsaremovingawayfromthesliding- window paradigm and operate instead on a relatively small collection of candidate win- dows[3,12,14,15,23,32,35,36](alsocalled‘objectproposals’). Thecandidatewindow generators are designed to respond to objects of any class, and typically just 1000−2000 candidatesaresufficienttocoverallobjectsinaclutteredimage[3,12,23,32]. Givenatest image,theobjectlocalizationtaskisthentoselectacandidatewindowcoveringaninstance of a particular class (e.g. cars). Following this trend, we generate about 1000 candidate windowsW ={w}N usingtherecentmethod[23]. i=1 Spatialrelationdescriptor. Weintroduceherearepresentationofthespatialrelationsbe- tween two windows w and w(cid:48), which we later use in our localization model (sec. 4). We summarizethespatialrelationbetweenwindowswandw(cid:48) usingthefollowingspatialrela- tiondescriptor(fig.2(a)) (cid:18)w∩w(cid:48) w∩w(cid:48) w∩w(cid:48)(cid:19) ρ(w,w(cid:48))= , , (1) w∪w(cid:48) w w(cid:48) wherethe∩operatorindicatestheareaoftheintersectionbetweenthewindows,and∪the areaoftheirunion.Thedescriptorcapturesthreedifferentkindsofspatialrelations.Thefirst isthefamiliarintersection-over-union(overlap),whichisoftenusedtoquantifytheaccuracy ofanobjectdetector[1,13]. Itis1whenw=w(cid:48),anddecaysrapidlywiththemisalignment betweenthetwowindows. Thesecondrelationmeasureshowmuchofwiscontainedinside w(cid:48). It is high when w is a part of w(cid:48), e.g. when w(cid:48) is a car and w is a wheel. The third relationmeasureshowmuchofw(cid:48) iscontainedinsidew. Itishighwhenwcontainsw(cid:48),e.g. w(cid:48)isasnookerballandwisasnookertable. Allthreerelationsare0ifwandw(cid:48)aredisjoint andare1ifwandw(cid:48) matchperfectly. Hencethedescriptorisindicativeforpart/container relationshipsofthetwowindowsandofhowwellalignedtheyare. Vector field of window relations. Relative to a particular candidate window w, we can l computethespatialrelationdescriptortoanywindoww. Thisinducesavectorfieldρ(·,w) l overthecontinuousspaceofallpossiblewindowpositions. Weobservethefieldonlyatthe discretesetofcandidatewindowsW. Akeyelementofourworkistoconnectthisfieldof spatialrelationstomeasurementsofappearancesimilaritybetweenwindows. Thisconnec- tionbetweenpositionandappearancespacesdrivesthenewcomponentsinourlocalization model(sec.4). 4 VEZHNEVETSANDFERRARI:LOOKINGOUTOFTHEWINDOW 3 Predicting spatial relations with the object A particularly interesting case is when w(cid:48) is the true bounding-box of an object w∗. For theimagesinthetrainingset,weknowthespatialrelationsρ(w,w∗)betweenallcandidate windowswandthebounding-boxw∗.Wecanusethemtolearntopredictthespatialrelation between candidate windows and the object from window appearance features x in a test image,whereground-truthbounding-boxesarenotgiven. Following [35], we use Gaussian Processes regression (GP) [27] to learn to predict a probability distribution P(ρr(w,w∗)|x) ∼ GP(m(x),K(x,x(cid:48))) for each spatial relation r ∈ {overlap,part,cont}givenwindowappearancefeaturesx. Weusezeromeanm(x)=0 and learn the kernel (covariance function) K(x,x(cid:48)) as in [35]. This kernel plays the role of anappearancesimilaritymeasurebetweentwowindows. TheGPlearnskernelparameters sothattheresultingappearancesimilaritycorrelateswiththespatialrelationtobepredicted, i.e. sothattwowindowswhichhavehighkernelvaluealsohaveasimilarspatialrelationto theground-truth. WewillusethelearntK(x,x(cid:48))laterinsec.4. Forawindoww inatestimage,theGPpredictsaGaussiandistributionforeachrelation i descriptors. Wedenotethemeansofthesepredictivedistributionsasµ(x)=(µoverlap(x), i i µpart(x),µcont(x)), and their standard deviation as σ(x). The standard deviation is the i i i sameforallrelations,asweusethesamekernelandinducingpoints. 4 Object localization with spatial relations We are given a test image with (a) set of candidate windows W ={w}N ; (b) their ap- i i=1 pearance features X ={x}N ; (c) the mean M ={µ(x)}N and standard deviation Σ= i i=1 i i=1 {σ(x)}N of their spatial relations with the object bounding-box, as predicted by the GP; i i=1 (d)theappearancesimilaritykernelK(x,x )(sec.3). i j Letw ∈W beacandidatewindowtobescored. Weproceedbydefiningasetofhyper- l featuresΦ(X,W,M,l)characterizingw,andthendefineourscoringfunctionthroughthem. l Consistencyofpredicted&inducedspatialrelationsφ C φr(X,W,l)=max|ρr(w,w)−µr(x)| (2) C i l i i Assumeforamomentthatw correctlylocalizesaninstanceoftheobjectclass. Select- l ingw wouldinducespatialrelationsρr(w,w)toallotherwindowsw. Thehyper-feature l i l i φ checkswhethertheseinducedspatialrelationsareconsistentwiththosepredictedbyGP C based on the appearance of the other windows (µr(x)). If so, that is a good sign that w i l isindeedthecorrectlocationoftheobject. Moreprecisely, thehyper-featuremeasuresthe disagreement between the induced ρr(w,w) and predicted µr(x) on the window w with i l i i thelargestdisagreement. Fig.3illustratesitonatoyexample. Themaximumdisagreement, insteadofamoreintuitivemean, islessinfluencedbydisagreementoverbackgroundwin- dows,whichareusuallypredictedbyGPtohavesmall,butnon-zerorelationstotheobject. It focuses better on the alignment of the peaks of the predicted {µr(x)}N and observed i i=1 {ρr(w,w)}N measurements of the vector field ρr(·,w), which is more indicative of w i l i=1 l l beingacorrectlocalization. VEZHNEVETSANDFERRARI:LOOKINGOUTOFTHEWINDOW 5 over.partcont. GP prediction [0.8, 0.8, 0.9] [0.5, 0.3, 0.9] [0.4, 0.3, 0.8] Spatial [1.0, 1.0, 1.0] relations [0.2, 0.2, 1.0] [0.3, 0.3, 1.0] Hyper- [0.3, 0.2, 0.2] feature Figure 3: Hyper-featuresφC forawindowwl arecomputedbyfindingamaximumdisagreement betweenspatialrelationsthatarepredictedbyGPµr(xi)andinducedbyspatialrelationsρ(wi,wl) forallotherwindowsw intheimage. Thefigureillustratesthisonatoyexamplewiththreewindows i wl,w1,w2. For each r∈{overlap,part,cont} the pairs of µr(xi) and ρ(wi,wl) with maximum disagreementarehighlightinred. Globalspatialrelations&appearanceφ G 2 N−1 N φr(X,W,l)= ∑ ∑ |ρr(w,w)−ρr(w ,w)|·K(x,x ). (3) G N2−N i l j l i j i=1 j=i+1 Thishyper-featurereactstopairsofcandidatewindows(w,w )withsimilarappearance i j (high K(x,x )) but different spatial relations to w. Two windows w,w contribute signif- i j l i j icantly to the sum if they look similar (high K(x,x )) and their spatial relations to w are i j l different(high|ρr(w,w)−ρr(w ,w)|). i l j l A large value of φ indicates that the vector field of the spatial relations to w is not G l smoothwithrespecttoappearancesimilarity. Thisindicatesthatw hasaspecialroleinthe l structureofspatialandappearancerelationswithinthatimage.Bymeasuringthispattern,φ G helps the localization algorithm to select a better window, when the information contained inappearancefeaturesofw aloneisnotenough. Forexample,awindoww tightlycovering l l asmallobjectsuchasthebaseballinfig.1(a)hashighφcont, becauseotherwindowscon- G tainingitoftenlooksimilartowindowsnotcontainingit. Inthiscase,ahighvalueofφ isa G positiveindicationforw beingacorrectlocalization.Ontheotherhand,awindoww tightly l l coveringthewolfinfig.1(b)haslowφoverlap,becausewindowsthatoverlapwithitareall G similartoeachotherinappearancespace. Inthiscase,thislowvalueisapositiveindication forw beingcorrect. Inwhichdirectiontousethishyper-featureislefttothelearningofits l weightinthefullscoringfunction,whichisseparateforeachobjectclass(sec4.2). Localspatialrelations&appearanceφ L 1 N φr(X,W,l)= ∑|1−ρr(w,w)|·K(x,x). (4) L N i l i l i=1 Thishyper-featureisanaloguetoφ ,butfocusesaroundw inappearancespace.Itisindica- G l tiveofwhetherwindowsthatlooksimilartow (highK(x,x))arealsosimilarinposition l i l intheimage,i.e. theirspatialrelationρr(w,w)tow iscloseto1. i l l Windowclassifierscoreφ . Thelasthyper-featureisthescoreofaclassifierwhichpre- S dicts whether a window w covers an instance of the class, based only on its appearance l features x. Standard approaches to object localization typically consider only this cue [8, l 6 VEZHNEVETSANDFERRARI:LOOKINGOUTOFTHEWINDOW 14,15,17,32,34,36]. Inprinciple,wecoulduseanysuchmethodasthescoreφ here. In S practice,wereusetheGPpredictionoftheoverlapofawindowwiththeobjectbounding-box asφ .Onepossibilitywouldbetosimplyusethemeanpredictedoverlapµoverlap(x).How- s l ever, asshownin [35], itisbeneficialtotakeintoaccounttheuncertaintyofthepredicted overlap,whichisalsoprovidedbytheGPasthestandarddeviationσ(x)oftheestimate l φ (X,l)=[µoverlap(x), σ(x)] (5) S l l Usingthishyper-featuresalonewouldcorrespondtothemethodof[35]. Completescore. Let Φ(X,W,l)=[{φr(X,W,l)} ,{φr(X,W,l)} ,{φr(X,W,l)} ,φ (X,l)] G r L r C r S beaconcatenationofallhyper-featuresdefinedaboveforaparticularcandidatewindoww, l overallthreepossiblerelations: r∈{overlap,part,cont}. Thisamountsto11featuresin total. Weformulatethefollowingscorefunction E(ααα,X,W,l)=(cid:104)ααα,Φ(X,W,l)(cid:105) (6) where(cid:104)·,·(cid:105)isthescalarproduct. Objectlocalizationtranslatestosolving lˆ=argmax E(ααα,X,W,l)overallcandidatewindowsinthetestimage.Thevectorofscalars l ααα parametrizesthescorefunctionbyweightingthehyper-features(possiblywithanegative weight). We show how to efficiently maximize E over l in sec. 4.1 and how to learn ααα in sec.4.2. 4.1 Fastinference Wecanmaximizethecompletescore(6)overlsimplybyevaluatingitforallpossibleland pickingthebestone. Themostcomputationallyexpensivepartisthehyper-featureφ (3). G Foragivenl,itsumsoverallpairsofcandidatewindows,whichrequiresO(N2)subtractions andmultiplications. Thus,anaivemaximizationoveralllcostsO(N3). Tosimplifythenotation,herewewritethescorefunctionwithonlyoneargumentE(l), astheotherargumentsααα,X,W arefixedduringmaximization. Notethat0≤|ρr(w,w)− i l ρr(w ,w)|≤1,therefore j l (cid:26) 0 ,αr ≤0 (cid:27) αr ·|ρr(w,w)−ρr(w ,w)|·K(x,x )≤ G (7) G i l j l i j αrK(x,x ) ,αr >0 G i j G Whereαr istheweightofhyper-featureφr. Bysubstitutingtheelementsinthesumover G G pairs in (3) with the bound (7), we obtain an upper bound on the φr term of the score (in G factthreebounds, oneforeachr). WecanthenobtainanupperboundE˜(l)≥E(l)onthe fullscorebycomputingallotherhyper-featuresandaddingthemtotheboundsonφr. This G upperboundE˜(l)isfasttocompute,as(7)onlydependsonappearancefeaturesX,notonl, andcomputingtheotherhyper-featuresislinearinl. WeusetheboundE˜ inanearlyrejectionalgorithm.Weformaqueuebysortingwindows indescendingorderofE˜(l). WethenevaluatethefullscoreE(l)ofthefirstl inthequeue and store it as the current maximum. We then go to the next l in the queue. If its upper boundE˜(l)issmallerthanthecurrentmaximum, thenwediscarditwithoutcomputingits fullscore. Otherwise,wecomputeE(l)andsetitasthecurrentmaximumifitisbetterthan thepreviousbestone. Weiterativelygothroughthequeueandattheendreturnthecurrent maximum(whichisnowthebestoveralll). Notice,thattheproposedfastinferencemethod isexact: itoutputsthesamesolutionasbruteforceevaluationofallpossiblel. VEZHNEVETSANDFERRARI:LOOKINGOUTOFTHEWINDOW 7 4.2 Learningα withstructuredoutputregression The scoring function (6) is parametrized by the weight vector ααα. We learn an ααα from the trainingdataofeachclassusingastructuredoutputregressionformulation[6, 19, 20, 33]. Ideally,welookforααα sothat,foreachtrainingimageI,thecandidatewindowl∗ thatbest I overlaps with the ground-truth bounding-box has the highest score. It is also good to en- courage the score difference between the best window l∗ and any other window w to be I l proportionaltotheiroverlap. Thismakesthelearningproblemsmootherandbetterbehaved thanwhenusinganaive0/1losswhichequallypenalizesallwindowsotherthanl∗. Hence, I weusetheloss∆(l,l(cid:48))=1−ρoverlap(wl,wl(cid:48))proposedby[6], whichformalizesthisintu- ition. Wecanfindααα bysolvingthefollowingoptimizationproblem 1 min ||ααα||2+γ∑(ξ ) I ααα,ξ 2 I S.T. ξI ≥0,∀I:(cid:104)ααα,φ(XI,lI∗)(cid:105)−(cid:104)ααα,φ(XI,l)(cid:105)≥∆(l,lI∗)−ξI,∀I,∀l∈LI\lI∗ (8) whereI indexesoveralltrainingimages. Thisisaconvexoptimizationproblem,butithas hundred of thousands of constraints (i.e. about 1000 candidate windows for each training image,timesabout500trainingimagesperclassinourexperiments). Wesolveitefficiently usingquadraticprogrammingwithconstraintgeneration[5]. Thisinvolvesfindingthemost violatedconstraintforacurrentααα. Wedothisexactlyaswecansolvetheinferenceproblem (6)andtheloss∆decomposesintoasumoftermswhichdependonasinglewindow.Thanks tothis,theconstraintgenerationprocedurewillfindtheglobaloptimumof(8)[5]. 5 Related work The first work to try to annotate object locations in ImageNet [15] addressed it as a win- dowclassificationproblem,whereaclassifieristrainedonannotatedimagesisthenusedto classifywindowsinimageswithnoannotation. Later[35]proposedtoregresstheoverlap of a window with the object using GP [27], which allowed for self-assessment thanks to GPprobabilisticoutput. Webuildon[35],usingAssociativeEmbeddingtolearnthekernel betweenwindowsappearancefeaturesandGPtopredictthespatialrelationsbetweenawin- dowandtheobject. Notehowthemodelof[35]isequivalenttoourswhenusingonlytheφ S hyper-feature. Importantly, both [15, 35], as well as many other object localization techniques [8, 14, 32, 34, 36], score each window individually based only on its own appearance. Our work goesbeyondbyevaluatingwindowsbasedonrichercuesmeasuredoutsidethewindow.This isrelatedtopreviousworkoncontext[7,10,17,18,25,26,30]aswellastoworksthatuse structuredoutputregressionformulation[6,19,20,33]. Wereviewtheseareasbelow. Context. TheseminalworkofTorralba[30]hasshownthatglobalimagedescriptorssuch as GIST give a valuable cue about which classes might be present in an image (e.g. in- door scenes are likely to have TVs, but unlikely to have cars). Since then, many object detectors [2, 17, 25, 28, 29, 32] have employed such global context to re-score their de- tections, thereby removing out-of-context false-positives. Some of these works also incor- porate the region surrounding the object into the window classifier to leverage local con- text[8,22,24,32]. Other works [7, 10, 18, 26] model context as the interactions between multiple object classes in the same image. Rabinovich et al. [26] use local detectors to first assign a class 8 VEZHNEVETSANDFERRARI:LOOKINGOUTOFTHEWINDOW label to each segment in the image and then adjusts these labels by taking into account co-occurrence between classes. Heitz and Koller [18] exploits context provided by "stuff" (backgroundclasseslikeroad)toguidethelocalizationof"things"(objectslikecars). Sev- eral works [7, 10] model the co-occurrence and spatial relations between object classes in thetrainingdataandusethemtopost-processtheoutputofindividualobjectclassdetectors. Anextensiveempiricalstudyofdifferentcontext-basedmethodscanfoundin[11]. Theforcedrivingthoseworksisthesemanticandspatialstructureofscenesasarrange- ments of different object classes in particular positions. Instead, our technique works on a different level, improving object localization for a single class by integrating cues from the appearance and spatial relations of all windows in an image. It can be seen as a new, complementaryformofcontext. Localizationwithstructuredoutputregression wasfirstproposedby[6]. Theydevised atrainingstrategythatspecificallyoptimizeslocalizationaccuracy,bytakingintoaccountthe overlapoftrainingimagewindowswiththeground-truth. Theytrytolearnafunctionwhich scores windows with high overlap with ground-truth higher than those with low overlap. Theapproachwasextendedby[33]toincludelatentvariablesforhandlingmultipleaspects of appearance and truncated object instances. At test time an efficient branch-and-bound algorithmisusedtofindthewindowwiththemaximumscore. Branch-and-boundmethods forlocalizationwherefurtherexploredin[19,20]. Importantly, the scoring function in [6, 19, 20, 33] still scores each window in the test imageindependently. Inourworkinsteadwescoreeachwindowinthecontextofallother windows in the image, taking into account their similarity in appearance space as well as their spatial relations in the image plane (sec. 4). We also use structured output regres- sion [31] for learning the parameters of our scoring function (sec. 4.2). However, due to interactionbetweenallwindowsinthetestimage,ourmaximizationproblemismorecom- plexthanin[6,19],makingtheirbranch-and-boundmethodinapplicable.Instead,wedevise anearly-rejectionmethodthatusestheparticularstructureofourscoringfunctiontoreduce thenumberofevaluationsofitsmostexpensiveterms(sec.4.1). 6 Experiments and conclusions WeperformexperimentsonthesubsetofImageNet[9]definedby[15,16,35],whichcon- sists of 219 classes for a total of 92K images with ground-truth bounding-boxes. Follow- ing [35], we split them in two disjoint subsets of 60K and 32K for training and testing respectively. Theclassesareverydiverseandincludeanimalsaswellasman-madeobjects (fig. 5). The task is to localize the object of interest in images known to contain a given class[15,16,35]. Wetrainaseparatemodelforeachclassusingthecorrespondingimages fromthetrainingset. Features. For our method and all the baselines we use the same features as AE-GP+ method from [35]: (i) three ultra-dense SIFT bag-of-words histograms on different color spaces [32] (each 36000 dimensions); (ii) a SURF bag-of-word from [35] (17000 dimen- sions); (iii) HOG [8] (2048 dimensions). We embed each feature type separately in a 10- dimensionalAE[35]space. Next,weconcatenatethemtogetherandaddlocationandscale featuresasin[15,35]. Intotal,thisleadstoa54-dimensionalspaceonwhichtheGPoper- ates,i.e. only54parameterstolearnfortheGPkernel. VEZHNEVETSANDFERRARI:LOOKINGOUTOFTHEWINDOW 9 0.9 Thispaper 0.85 AE−GP 0.8 MKL−SVM UVA p0.75 a erl 0.7 v O0.65 0.6 0.55 0.5 0 0.2 0.4 0.6 0.8 1 Fractionoftop−scoredboxes Figure4: Meanoverlapcurves. Weretainthesingletopscoringcandidatewindowinatestimage andmeasurethemeanoverlapoftheoutputwindowswiththeground-truth. Wevaryathresholdon thescoreoftheoutputwindowstogenerateperformancecurves.Thehigherthecurve,thebetter. 6.1 Baselinesandcompetitors MKL-SVM. This represents a standard, classifier driven approach to object localization, similar to [32, 34]. On 90% of the training set we train a separate SVM for each group of features described above. We combine these classifiers by training a linear SVM over their outputs on the remaining 10% of the data. We also include the location and scale featuresofawindowinthissecond-levelSVM.Thisbaselineusesexactlythesamecandidate windows[23]andfeaturesasourmethod(sec.6). UVA [32]. The popular method [32] can be seen as a smaller version of the MKL-SVM baselinewehavejustdefined. Inordertomakeamoreexactcomparisonto[32],weremove the additional features and use only their three SIFT bag-of-words. Moreover, instead of training a second level SVM, we simply combine their outputs by averaging. This corre- spondsto[32],butusingtherecentstate-of-the-artobjectproposals[23]insteadofselective searchproposals.Thismethod[32]isoneofthebestperformingobjectdetectors.Ithaswon theILSVRC2011[1]detectionchallengeandthePASCALVOC2012detectionchallenge. AE-GP[35]. Finally,wecomparetotheAE-GP+modelof[35].Itcorrespondstoadegen- erateversionofourmodelwhichusesonlyφ toscoreeachwindowinisolationbylooking S atitsownfeatures(5). Thisusesthesamecandidatewindows[23]andfeaturesweuse. This techniquewasshowntooutperformearlierworkonlocationannotationinImageNet[15]. 6.2 Results For each method we retain the single top scoring candidate window in a test image. We measurelocalizationaccuracyasthemeanoverlapoftheoutputwindowswiththeground- truth [35] (fig. 4). We vary a threshold on the score of the output windows to generate performancecurves. Thehigherthecurve,thebetter. Asfig.4shows,theproposedmethodconsistentlyoutperformsthecompetitorsandthe baselineoverthewholerangeofthecurve. Ourmethodachieves0.75meanoverlapwhen returningannotationsfor35%ofallimages.Atthesameaccuracylevel,AE-GP,MKL-SVM andUVAreturn28%,9%and4%imagesrespectively,i.e.wecanreturn7%moreannotation atthishighlevelofoverlapthantheclosestcompetitor. Producingveryaccuratebounding- box annotations is important for their intended use as training data for various models and tasks. ImprovingoverAE-GPvalidatestheproposedideaofscoringcandidatewindowsby 10 VEZHNEVETSANDFERRARI:LOOKINGOUTOFTHEWINDOW Tiger Hartebeest Skunk Barrow Guenon Wild boar Buckle Pool cue Racket Figure5: Qualitativeresults. Resultsofourmethod(red)vsAE-GP[35](green). Notice,howour methodisabletodetectsmall,off-centerobjectsdespiteocclusion(poolcue)ortheobjectblending withitssurroundings(tiger). taking into account spatial relations to all other windows and their appearance similarity. The favourable comparison to the excellent method UVA [32] and to its extended MKL- SVMversiondemonstratesthatoursystemofferscompetitiveperformance.Exampleobjects localizedbyourmethodandbyAE-GPareshowninfig.5.Ourmethodsuccessfullyoperates inclutteredimages(guenon,barrow,skunk). Itcanfindcamouflagedanimals(tiger),small objects (buckle, racket), and deal with diverse classes and high intra-class variation (pool cue,buckle,racket). Toevaluatetheimpactofourfastinferencealgorithm(sec.4.1)wecomparedittobrute force (i.e. evaluating the energy for all possible configurations) on the baseball class. On averagebruteforcetakes17.6sperimage,whereasourfastinferencetakes0.14s. Sinceour inferencemethodisexact,itproducesthesamesolutionasbruteforce,but124×faster. 6.3 Conclusion WehavepresentedanewmethodforannotatingthelocationofobjectsinImageNet,which goesbeyondconsideringonecandidatewindowatatime. Instead,itscoreseachwindowin thecontextofallotherwindowsintheimage,takingintoaccounttheirsimilarityinappear- ancespaceaswellastheirspatialrelationsintheimageplane. Aswehavedemonstratedon 92KimagesfromImageNet,ourmethodimprovesoversomeofthebestperformingobject localizationtechniques[32,35],includingtheonewebuildon[35]. Acknowledgment ThisworkwassupportedbytheERCStartingGrantVisCul. A.Vezh- nevetswasalsosupportedbySNSFfellowshipPBEZP-2142889. References [1] Imagenet large scale visual recognition challenge (ILSVRC). http://www.image- net.org/challenges/LSVRC/2011/index,2011. [2] University of amsterdam and euvision technologies at ilsvrc2013 (ILSVRC). http://www.image-net.org/challenges/LSVRC/2013/slides/ILSVRC2013-UvA- Euvision-web.pdf,2013.