Table Of ContentVEZHNEVETSANDFERRARI:LOOKINGOUTOFTHEWINDOW 1
5
Object localization in ImageNet
1
0
2by looking out of the window
g
u
AlexanderVezhnevets UniversityofEdinburgh
A
vezhnick@gmail.com Edinburgh,Scotland,UK
4
VittorioFerrari
vferrari@gmail.com
]
V
C
Abstract
.
s
c We propose a method for annotating the location of objects in ImageNet. Tradi-
[ tionally,thisiscastasanimagewindowclassificationproblem,whereeachwindowis
consideredindependentlyandscoredbasedonitsappearancealone.Instead,wepropose
2
a method which scores each candidate window in the context of all other windows in
v
theimage,takingintoaccounttheirsimilarityinappearancespaceaswellastheirspa-
1
8 tialrelationsintheimageplane. Wedeviseafastandexactproceduretooptimizeour
1 scoring function over all candidate windows in an image, and we learn its parameters
1 using structured output regression. We demonstrate on 92000 images from ImageNet
0 thatthissignificantlyimproveslocalizationoverrecenttechniquesthatscorewindowsin
. isolation[32,35].
1
0
51 Introduction
1
:
vThe ImageNet database [9] contains over 14 million images annotated by the class label
i
Xof the main object they contain. However, only a fraction of them have bounding-box an-
notations (10%). Automatically annotating object locations in ImageNet is a challenging
r
aproblem,whichhasrecentlydrawnattention[15,16,35]. Theseannotationscouldbeused
astrainingdataforproblemssuchasobjectclassdetection[8],tracking[21]andposeesti-
mation [4]. Traditionally, object localization is cast as an image window scoring problem,
whereascoringfunctionistrainedonimageswithbounding-boxesandappliedtooneswith-
out. The image is first decomposed into candidate windows, typically by object proposal
generation[3,12,23,32]. Eachwindowisthenscoredbyaclassifiertrainedtodiscriminate
instancesoftheclassfromotherwindows[8, 14, 15, 17, 32, 34, 36]oraregressortrained
to predict their overlap with the object [6, 20, 33, 35]. Highly scored windows are finally
deemedtocontaintheobject. Inthisparadigm,theclassifierlooksatonewindowatatime,
makingadecisionbasedonlyonthatwindow’sappearance.
We believe there is more information in the collection of windows in an image. By
takingintoaccounttheappearanceofallwindowsatthesametimeandconnectingittotheir
spatialrelationsintheimageplane,wecouldgobeyondwhatcanbedonebylookingatone
window at a time. Consider the baseball in fig. 1(a). For a traditional method to succeed,
theappearanceclassifierneedstoscorethewindowonthebaseballhigherthanthewindows
containingit. Thecontainerwindowscannothelpexceptbyscoringlowerandbediscarded.
By considering one window at a time with a classifier that only tries to predict whether it
(cid:13)c 2015.Thecopyrightofthisdocumentresideswithitsauthors.
Itmaybedistributedunchangedfreelyinprintorelectronicforms.
2 VEZHNEVETSANDFERRARI:LOOKINGOUTOFTHEWINDOW
(a) (b)
Image space Window appearance space Image space Window appearance space
Figure 1: Connecting the appearance and window position spaces. (a) a window tight on the
baseball(greenstarintheappearancespaceplot)andsomelargerwindowscontainingit(redcircles
intheappearancespace). Blackpointsinappearancespacerepresentallothercandidatewindows.
(b)allwindowswithhighoverlapwiththewolfareshowninred,bothintheimageandinappearance
space. The ground-truth bounding-box to be found is shown in green. The appearance space plots
areactualdatapoints,representingwindowsin3-dimensionalAssociativeEmbeddingofSURFbag-
of-words.Pleaseseemaintextfordiscussion.
coverstheobjecttightly, onecannotdomuchmorethanthat. Thefirstkeyelementofour
workistopredictricherspatialrelationsbetweeneachcandidatewindowandtheobjectto
be detected, including part and container relations. The second key element is to employ
thesepredictionstoreasonaboutrelationsbetweendifferentwindows. Inthisexample,the
containerwindowsarepredictedtocontainasmallertargetobjectsomewhereinsidethem,
andtherebyactivelyhelpbyreinforcingthescoreofthebaseballwindow.Fig.1(b)illustrates
another example of the benefits of analyzing all windows jointly. Several windows which
have high overlap with each other and with the wolf form a dense cluster in appearance
space, making it hard to discriminate the precise bounding-box by its appearance alone.
However,theprecisebounding-boxispositionedatanextremepointofthecluster—onthe
tip. By considering the configuration of all the windows in appearance space together we
canreinforceitsscore.
Inanutshell,weproposetolocalizeobjectsinImageNetbyscoringeachcandidatewin-
dowinthecontextofallotherwindowsintheimage,takingintoaccounttheirsimilarityin
appearance space as well as their spatial relations in the image plane. To represent spatial
relations of windows we propose a descriptor indicative of the part/container relationship
of the two windows and of how well aligned they are (sec. 2). We learn a windows ap-
pearancesimilaritykernelusingtherecentAssociativeEmbeddingtechnique[35](sec.3).
Wedescribeeachwindowwithasetofhyper-featuresconnectingtheappearancesimilarity
andspatialrelationsofthatwindowtoallotherwindowsinthesameimage. Thesehyper-
featuresareindicativeoftheobject’spresencewhentheappearanceofawindowaloneisnot
enough(e.g. fig1). Thesehyper-featuresarethenlinearlycombinedintoanoverallscoring
function(sec.4). Wedeviseafastandexactproceduretooptimizeourscoringfunctionover
allcandidatewindowsinatestimage(sec.4.1),andwelearnitsparametersusingstructured
outputregression[31](sec.4.2).
WeevaluateourmethodonasubsetofImageNetcontaining219classeswithmorethan
92000 images [15, 16, 35]. The experiments show that our method outperforms a recent
approach for this task [35], an MKL-SVM baseline [34] based on the same features, and
thepopularUVAobjectdetector[32]. Theremainderofthepaperisorganizedasfollows.
Sec. 2 and 3 introduce the spatial relation descriptors which we use in sec. 4 to define our
newlocalizationmodel. Insec.5wereviewrelatedwork. Experimentsandconclusionsare
presentedinsec.6.
VEZHNEVETSANDFERRARI:LOOKINGOUTOFTHEWINDOW 3
A E
B
D
7
C
(a) (b)
Figure2: Spatialrelationsρ(w,w(cid:48))betweentwowindows.(a)Thefirstelementindicatehowmuch
wandw(cid:48)overlap,inthetraditionalPASCALVOCsense[13]. Thesecondelementindicateswhether
windowwisapartofw(cid:48). Thethirdelementmeasureswhetherwisacontainerofw(cid:48). (b)Computing
thespatialrelationsρ(wi,wl)andρ(wj,wl)forhyper-featuresφG(3)andφL(4).
2 Describing spatial relations between windows
Candidatewindows. Recently, objectclassdetectorsaremovingawayfromthesliding-
window paradigm and operate instead on a relatively small collection of candidate win-
dows[3,12,14,15,23,32,35,36](alsocalled‘objectproposals’). Thecandidatewindow
generators are designed to respond to objects of any class, and typically just 1000−2000
candidatesaresufficienttocoverallobjectsinaclutteredimage[3,12,23,32]. Givenatest
image,theobjectlocalizationtaskisthentoselectacandidatewindowcoveringaninstance
of a particular class (e.g. cars). Following this trend, we generate about 1000 candidate
windowsW ={w}N usingtherecentmethod[23].
i=1
Spatialrelationdescriptor. Weintroduceherearepresentationofthespatialrelationsbe-
tween two windows w and w(cid:48), which we later use in our localization model (sec. 4). We
summarizethespatialrelationbetweenwindowswandw(cid:48) usingthefollowingspatialrela-
tiondescriptor(fig.2(a))
(cid:18)w∩w(cid:48) w∩w(cid:48) w∩w(cid:48)(cid:19)
ρ(w,w(cid:48))= , , (1)
w∪w(cid:48) w w(cid:48)
wherethe∩operatorindicatestheareaoftheintersectionbetweenthewindows,and∪the
areaoftheirunion.Thedescriptorcapturesthreedifferentkindsofspatialrelations.Thefirst
isthefamiliarintersection-over-union(overlap),whichisoftenusedtoquantifytheaccuracy
ofanobjectdetector[1,13]. Itis1whenw=w(cid:48),anddecaysrapidlywiththemisalignment
betweenthetwowindows. Thesecondrelationmeasureshowmuchofwiscontainedinside
w(cid:48). It is high when w is a part of w(cid:48), e.g. when w(cid:48) is a car and w is a wheel. The third
relationmeasureshowmuchofw(cid:48) iscontainedinsidew. Itishighwhenwcontainsw(cid:48),e.g.
w(cid:48)isasnookerballandwisasnookertable. Allthreerelationsare0ifwandw(cid:48)aredisjoint
andare1ifwandw(cid:48) matchperfectly. Hencethedescriptorisindicativeforpart/container
relationshipsofthetwowindowsandofhowwellalignedtheyare.
Vector field of window relations. Relative to a particular candidate window w, we can
l
computethespatialrelationdescriptortoanywindoww. Thisinducesavectorfieldρ(·,w)
l
overthecontinuousspaceofallpossiblewindowpositions. Weobservethefieldonlyatthe
discretesetofcandidatewindowsW. Akeyelementofourworkistoconnectthisfieldof
spatialrelationstomeasurementsofappearancesimilaritybetweenwindows. Thisconnec-
tionbetweenpositionandappearancespacesdrivesthenewcomponentsinourlocalization
model(sec.4).
4 VEZHNEVETSANDFERRARI:LOOKINGOUTOFTHEWINDOW
3 Predicting spatial relations with the object
A particularly interesting case is when w(cid:48) is the true bounding-box of an object w∗. For
theimagesinthetrainingset,weknowthespatialrelationsρ(w,w∗)betweenallcandidate
windowswandthebounding-boxw∗.Wecanusethemtolearntopredictthespatialrelation
between candidate windows and the object from window appearance features x in a test
image,whereground-truthbounding-boxesarenotgiven.
Following [35], we use Gaussian Processes regression (GP) [27] to learn to predict a
probability distribution P(ρr(w,w∗)|x) ∼ GP(m(x),K(x,x(cid:48))) for each spatial relation r ∈
{overlap,part,cont}givenwindowappearancefeaturesx. Weusezeromeanm(x)=0
and learn the kernel (covariance function) K(x,x(cid:48)) as in [35]. This kernel plays the role of
anappearancesimilaritymeasurebetweentwowindows. TheGPlearnskernelparameters
sothattheresultingappearancesimilaritycorrelateswiththespatialrelationtobepredicted,
i.e. sothattwowindowswhichhavehighkernelvaluealsohaveasimilarspatialrelationto
theground-truth. WewillusethelearntK(x,x(cid:48))laterinsec.4.
Forawindoww inatestimage,theGPpredictsaGaussiandistributionforeachrelation
i
descriptors. Wedenotethemeansofthesepredictivedistributionsasµ(x)=(µoverlap(x),
i i
µpart(x),µcont(x)), and their standard deviation as σ(x). The standard deviation is the
i i i
sameforallrelations,asweusethesamekernelandinducingpoints.
4 Object localization with spatial relations
We are given a test image with (a) set of candidate windows W ={w}N ; (b) their ap-
i i=1
pearance features X ={x}N ; (c) the mean M ={µ(x)}N and standard deviation Σ=
i i=1 i i=1
{σ(x)}N of their spatial relations with the object bounding-box, as predicted by the GP;
i i=1
(d)theappearancesimilaritykernelK(x,x )(sec.3).
i j
Letw ∈W beacandidatewindowtobescored. Weproceedbydefiningasetofhyper-
l
featuresΦ(X,W,M,l)characterizingw,andthendefineourscoringfunctionthroughthem.
l
Consistencyofpredicted&inducedspatialrelationsφ
C
φr(X,W,l)=max|ρr(w,w)−µr(x)| (2)
C i l i
i
Assumeforamomentthatw correctlylocalizesaninstanceoftheobjectclass. Select-
l
ingw wouldinducespatialrelationsρr(w,w)toallotherwindowsw. Thehyper-feature
l i l i
φ checkswhethertheseinducedspatialrelationsareconsistentwiththosepredictedbyGP
C
based on the appearance of the other windows (µr(x)). If so, that is a good sign that w
i l
isindeedthecorrectlocationoftheobject. Moreprecisely, thehyper-featuremeasuresthe
disagreement between the induced ρr(w,w) and predicted µr(x) on the window w with
i l i i
thelargestdisagreement. Fig.3illustratesitonatoyexample. Themaximumdisagreement,
insteadofamoreintuitivemean, islessinfluencedbydisagreementoverbackgroundwin-
dows,whichareusuallypredictedbyGPtohavesmall,butnon-zerorelationstotheobject.
It focuses better on the alignment of the peaks of the predicted {µr(x)}N and observed
i i=1
{ρr(w,w)}N measurements of the vector field ρr(·,w), which is more indicative of w
i l i=1 l l
beingacorrectlocalization.
VEZHNEVETSANDFERRARI:LOOKINGOUTOFTHEWINDOW 5
over.partcont.
GP prediction [0.8, 0.8, 0.9]
[0.5, 0.3, 0.9]
[0.4, 0.3, 0.8]
Spatial [1.0, 1.0, 1.0]
relations
[0.2, 0.2, 1.0]
[0.3, 0.3, 1.0]
Hyper- [0.3, 0.2, 0.2]
feature
Figure 3: Hyper-featuresφC forawindowwl arecomputedbyfindingamaximumdisagreement
betweenspatialrelationsthatarepredictedbyGPµr(xi)andinducedbyspatialrelationsρ(wi,wl)
forallotherwindowsw intheimage. Thefigureillustratesthisonatoyexamplewiththreewindows
i
wl,w1,w2. For each r∈{overlap,part,cont} the pairs of µr(xi) and ρ(wi,wl) with maximum
disagreementarehighlightinred.
Globalspatialrelations&appearanceφ
G
2 N−1 N
φr(X,W,l)= ∑ ∑ |ρr(w,w)−ρr(w ,w)|·K(x,x ). (3)
G N2−N i l j l i j
i=1 j=i+1
Thishyper-featurereactstopairsofcandidatewindows(w,w )withsimilarappearance
i j
(high K(x,x )) but different spatial relations to w. Two windows w,w contribute signif-
i j l i j
icantly to the sum if they look similar (high K(x,x )) and their spatial relations to w are
i j l
different(high|ρr(w,w)−ρr(w ,w)|).
i l j l
A large value of φ indicates that the vector field of the spatial relations to w is not
G l
smoothwithrespecttoappearancesimilarity. Thisindicatesthatw hasaspecialroleinthe
l
structureofspatialandappearancerelationswithinthatimage.Bymeasuringthispattern,φ
G
helps the localization algorithm to select a better window, when the information contained
inappearancefeaturesofw aloneisnotenough. Forexample,awindoww tightlycovering
l l
asmallobjectsuchasthebaseballinfig.1(a)hashighφcont, becauseotherwindowscon-
G
tainingitoftenlooksimilartowindowsnotcontainingit. Inthiscase,ahighvalueofφ isa
G
positiveindicationforw beingacorrectlocalization.Ontheotherhand,awindoww tightly
l l
coveringthewolfinfig.1(b)haslowφoverlap,becausewindowsthatoverlapwithitareall
G
similartoeachotherinappearancespace. Inthiscase,thislowvalueisapositiveindication
forw beingcorrect. Inwhichdirectiontousethishyper-featureislefttothelearningofits
l
weightinthefullscoringfunction,whichisseparateforeachobjectclass(sec4.2).
Localspatialrelations&appearanceφ
L
1 N
φr(X,W,l)= ∑|1−ρr(w,w)|·K(x,x). (4)
L N i l i l
i=1
Thishyper-featureisanaloguetoφ ,butfocusesaroundw inappearancespace.Itisindica-
G l
tiveofwhetherwindowsthatlooksimilartow (highK(x,x))arealsosimilarinposition
l i l
intheimage,i.e. theirspatialrelationρr(w,w)tow iscloseto1.
i l l
Windowclassifierscoreφ . Thelasthyper-featureisthescoreofaclassifierwhichpre-
S
dicts whether a window w covers an instance of the class, based only on its appearance
l
features x. Standard approaches to object localization typically consider only this cue [8,
l
6 VEZHNEVETSANDFERRARI:LOOKINGOUTOFTHEWINDOW
14,15,17,32,34,36]. Inprinciple,wecoulduseanysuchmethodasthescoreφ here. In
S
practice,wereusetheGPpredictionoftheoverlapofawindowwiththeobjectbounding-box
asφ .Onepossibilitywouldbetosimplyusethemeanpredictedoverlapµoverlap(x).How-
s l
ever, asshownin [35], itisbeneficialtotakeintoaccounttheuncertaintyofthepredicted
overlap,whichisalsoprovidedbytheGPasthestandarddeviationσ(x)oftheestimate
l
φ (X,l)=[µoverlap(x), σ(x)] (5)
S l l
Usingthishyper-featuresalonewouldcorrespondtothemethodof[35].
Completescore. Let
Φ(X,W,l)=[{φr(X,W,l)} ,{φr(X,W,l)} ,{φr(X,W,l)} ,φ (X,l)]
G r L r C r S
beaconcatenationofallhyper-featuresdefinedaboveforaparticularcandidatewindoww,
l
overallthreepossiblerelations: r∈{overlap,part,cont}. Thisamountsto11featuresin
total. Weformulatethefollowingscorefunction
E(ααα,X,W,l)=(cid:104)ααα,Φ(X,W,l)(cid:105) (6)
where(cid:104)·,·(cid:105)isthescalarproduct. Objectlocalizationtranslatestosolving
lˆ=argmax E(ααα,X,W,l)overallcandidatewindowsinthetestimage.Thevectorofscalars
l
ααα parametrizesthescorefunctionbyweightingthehyper-features(possiblywithanegative
weight). We show how to efficiently maximize E over l in sec. 4.1 and how to learn ααα in
sec.4.2.
4.1 Fastinference
Wecanmaximizethecompletescore(6)overlsimplybyevaluatingitforallpossibleland
pickingthebestone. Themostcomputationallyexpensivepartisthehyper-featureφ (3).
G
Foragivenl,itsumsoverallpairsofcandidatewindows,whichrequiresO(N2)subtractions
andmultiplications. Thus,anaivemaximizationoveralllcostsO(N3).
Tosimplifythenotation,herewewritethescorefunctionwithonlyoneargumentE(l),
astheotherargumentsααα,X,W arefixedduringmaximization. Notethat0≤|ρr(w,w)−
i l
ρr(w ,w)|≤1,therefore
j l
(cid:26) 0 ,αr ≤0 (cid:27)
αr ·|ρr(w,w)−ρr(w ,w)|·K(x,x )≤ G (7)
G i l j l i j αrK(x,x ) ,αr >0
G i j G
Whereαr istheweightofhyper-featureφr. Bysubstitutingtheelementsinthesumover
G G
pairs in (3) with the bound (7), we obtain an upper bound on the φr term of the score (in
G
factthreebounds, oneforeachr). WecanthenobtainanupperboundE˜(l)≥E(l)onthe
fullscorebycomputingallotherhyper-featuresandaddingthemtotheboundsonφr. This
G
upperboundE˜(l)isfasttocompute,as(7)onlydependsonappearancefeaturesX,notonl,
andcomputingtheotherhyper-featuresislinearinl.
WeusetheboundE˜ inanearlyrejectionalgorithm.Weformaqueuebysortingwindows
indescendingorderofE˜(l). WethenevaluatethefullscoreE(l)ofthefirstl inthequeue
and store it as the current maximum. We then go to the next l in the queue. If its upper
boundE˜(l)issmallerthanthecurrentmaximum, thenwediscarditwithoutcomputingits
fullscore. Otherwise,wecomputeE(l)andsetitasthecurrentmaximumifitisbetterthan
thepreviousbestone. Weiterativelygothroughthequeueandattheendreturnthecurrent
maximum(whichisnowthebestoveralll). Notice,thattheproposedfastinferencemethod
isexact: itoutputsthesamesolutionasbruteforceevaluationofallpossiblel.
VEZHNEVETSANDFERRARI:LOOKINGOUTOFTHEWINDOW 7
4.2 Learningα withstructuredoutputregression
The scoring function (6) is parametrized by the weight vector ααα. We learn an ααα from the
trainingdataofeachclassusingastructuredoutputregressionformulation[6, 19, 20, 33].
Ideally,welookforααα sothat,foreachtrainingimageI,thecandidatewindowl∗ thatbest
I
overlaps with the ground-truth bounding-box has the highest score. It is also good to en-
courage the score difference between the best window l∗ and any other window w to be
I l
proportionaltotheiroverlap. Thismakesthelearningproblemsmootherandbetterbehaved
thanwhenusinganaive0/1losswhichequallypenalizesallwindowsotherthanl∗. Hence,
I
weusetheloss∆(l,l(cid:48))=1−ρoverlap(wl,wl(cid:48))proposedby[6], whichformalizesthisintu-
ition. Wecanfindααα bysolvingthefollowingoptimizationproblem
1
min ||ααα||2+γ∑(ξ )
I
ααα,ξ 2 I
S.T. ξI ≥0,∀I:(cid:104)ααα,φ(XI,lI∗)(cid:105)−(cid:104)ααα,φ(XI,l)(cid:105)≥∆(l,lI∗)−ξI,∀I,∀l∈LI\lI∗ (8)
whereI indexesoveralltrainingimages. Thisisaconvexoptimizationproblem,butithas
hundred of thousands of constraints (i.e. about 1000 candidate windows for each training
image,timesabout500trainingimagesperclassinourexperiments). Wesolveitefficiently
usingquadraticprogrammingwithconstraintgeneration[5]. Thisinvolvesfindingthemost
violatedconstraintforacurrentααα. Wedothisexactlyaswecansolvetheinferenceproblem
(6)andtheloss∆decomposesintoasumoftermswhichdependonasinglewindow.Thanks
tothis,theconstraintgenerationprocedurewillfindtheglobaloptimumof(8)[5].
5 Related work
The first work to try to annotate object locations in ImageNet [15] addressed it as a win-
dowclassificationproblem,whereaclassifieristrainedonannotatedimagesisthenusedto
classifywindowsinimageswithnoannotation. Later[35]proposedtoregresstheoverlap
of a window with the object using GP [27], which allowed for self-assessment thanks to
GPprobabilisticoutput. Webuildon[35],usingAssociativeEmbeddingtolearnthekernel
betweenwindowsappearancefeaturesandGPtopredictthespatialrelationsbetweenawin-
dowandtheobject. Notehowthemodelof[35]isequivalenttoourswhenusingonlytheφ
S
hyper-feature.
Importantly, both [15, 35], as well as many other object localization techniques [8, 14,
32, 34, 36], score each window individually based only on its own appearance. Our work
goesbeyondbyevaluatingwindowsbasedonrichercuesmeasuredoutsidethewindow.This
isrelatedtopreviousworkoncontext[7,10,17,18,25,26,30]aswellastoworksthatuse
structuredoutputregressionformulation[6,19,20,33]. Wereviewtheseareasbelow.
Context. TheseminalworkofTorralba[30]hasshownthatglobalimagedescriptorssuch
as GIST give a valuable cue about which classes might be present in an image (e.g. in-
door scenes are likely to have TVs, but unlikely to have cars). Since then, many object
detectors [2, 17, 25, 28, 29, 32] have employed such global context to re-score their de-
tections, thereby removing out-of-context false-positives. Some of these works also incor-
porate the region surrounding the object into the window classifier to leverage local con-
text[8,22,24,32].
Other works [7, 10, 18, 26] model context as the interactions between multiple object
classes in the same image. Rabinovich et al. [26] use local detectors to first assign a class
8 VEZHNEVETSANDFERRARI:LOOKINGOUTOFTHEWINDOW
label to each segment in the image and then adjusts these labels by taking into account
co-occurrence between classes. Heitz and Koller [18] exploits context provided by "stuff"
(backgroundclasseslikeroad)toguidethelocalizationof"things"(objectslikecars). Sev-
eral works [7, 10] model the co-occurrence and spatial relations between object classes in
thetrainingdataandusethemtopost-processtheoutputofindividualobjectclassdetectors.
Anextensiveempiricalstudyofdifferentcontext-basedmethodscanfoundin[11].
Theforcedrivingthoseworksisthesemanticandspatialstructureofscenesasarrange-
ments of different object classes in particular positions. Instead, our technique works on
a different level, improving object localization for a single class by integrating cues from
the appearance and spatial relations of all windows in an image. It can be seen as a new,
complementaryformofcontext.
Localizationwithstructuredoutputregression wasfirstproposedby[6]. Theydevised
atrainingstrategythatspecificallyoptimizeslocalizationaccuracy,bytakingintoaccountthe
overlapoftrainingimagewindowswiththeground-truth. Theytrytolearnafunctionwhich
scores windows with high overlap with ground-truth higher than those with low overlap.
Theapproachwasextendedby[33]toincludelatentvariablesforhandlingmultipleaspects
of appearance and truncated object instances. At test time an efficient branch-and-bound
algorithmisusedtofindthewindowwiththemaximumscore. Branch-and-boundmethods
forlocalizationwherefurtherexploredin[19,20].
Importantly, the scoring function in [6, 19, 20, 33] still scores each window in the test
imageindependently. Inourworkinsteadwescoreeachwindowinthecontextofallother
windows in the image, taking into account their similarity in appearance space as well as
their spatial relations in the image plane (sec. 4). We also use structured output regres-
sion [31] for learning the parameters of our scoring function (sec. 4.2). However, due to
interactionbetweenallwindowsinthetestimage,ourmaximizationproblemismorecom-
plexthanin[6,19],makingtheirbranch-and-boundmethodinapplicable.Instead,wedevise
anearly-rejectionmethodthatusestheparticularstructureofourscoringfunctiontoreduce
thenumberofevaluationsofitsmostexpensiveterms(sec.4.1).
6 Experiments and conclusions
WeperformexperimentsonthesubsetofImageNet[9]definedby[15,16,35],whichcon-
sists of 219 classes for a total of 92K images with ground-truth bounding-boxes. Follow-
ing [35], we split them in two disjoint subsets of 60K and 32K for training and testing
respectively. Theclassesareverydiverseandincludeanimalsaswellasman-madeobjects
(fig. 5). The task is to localize the object of interest in images known to contain a given
class[15,16,35]. Wetrainaseparatemodelforeachclassusingthecorrespondingimages
fromthetrainingset.
Features. For our method and all the baselines we use the same features as AE-GP+
method from [35]: (i) three ultra-dense SIFT bag-of-words histograms on different color
spaces [32] (each 36000 dimensions); (ii) a SURF bag-of-word from [35] (17000 dimen-
sions); (iii) HOG [8] (2048 dimensions). We embed each feature type separately in a 10-
dimensionalAE[35]space. Next,weconcatenatethemtogetherandaddlocationandscale
featuresasin[15,35]. Intotal,thisleadstoa54-dimensionalspaceonwhichtheGPoper-
ates,i.e. only54parameterstolearnfortheGPkernel.
VEZHNEVETSANDFERRARI:LOOKINGOUTOFTHEWINDOW 9
0.9
Thispaper
0.85
AE−GP
0.8 MKL−SVM
UVA
p0.75
a
erl 0.7
v
O0.65
0.6
0.55
0.5
0 0.2 0.4 0.6 0.8 1
Fractionoftop−scoredboxes
Figure4: Meanoverlapcurves. Weretainthesingletopscoringcandidatewindowinatestimage
andmeasurethemeanoverlapoftheoutputwindowswiththeground-truth. Wevaryathresholdon
thescoreoftheoutputwindowstogenerateperformancecurves.Thehigherthecurve,thebetter.
6.1 Baselinesandcompetitors
MKL-SVM. This represents a standard, classifier driven approach to object localization,
similar to [32, 34]. On 90% of the training set we train a separate SVM for each group
of features described above. We combine these classifiers by training a linear SVM over
their outputs on the remaining 10% of the data. We also include the location and scale
featuresofawindowinthissecond-levelSVM.Thisbaselineusesexactlythesamecandidate
windows[23]andfeaturesasourmethod(sec.6).
UVA [32]. The popular method [32] can be seen as a smaller version of the MKL-SVM
baselinewehavejustdefined. Inordertomakeamoreexactcomparisonto[32],weremove
the additional features and use only their three SIFT bag-of-words. Moreover, instead of
training a second level SVM, we simply combine their outputs by averaging. This corre-
spondsto[32],butusingtherecentstate-of-the-artobjectproposals[23]insteadofselective
searchproposals.Thismethod[32]isoneofthebestperformingobjectdetectors.Ithaswon
theILSVRC2011[1]detectionchallengeandthePASCALVOC2012detectionchallenge.
AE-GP[35]. Finally,wecomparetotheAE-GP+modelof[35].Itcorrespondstoadegen-
erateversionofourmodelwhichusesonlyφ toscoreeachwindowinisolationbylooking
S
atitsownfeatures(5). Thisusesthesamecandidatewindows[23]andfeaturesweuse. This
techniquewasshowntooutperformearlierworkonlocationannotationinImageNet[15].
6.2 Results
For each method we retain the single top scoring candidate window in a test image. We
measurelocalizationaccuracyasthemeanoverlapoftheoutputwindowswiththeground-
truth [35] (fig. 4). We vary a threshold on the score of the output windows to generate
performancecurves. Thehigherthecurve,thebetter.
Asfig.4shows,theproposedmethodconsistentlyoutperformsthecompetitorsandthe
baselineoverthewholerangeofthecurve. Ourmethodachieves0.75meanoverlapwhen
returningannotationsfor35%ofallimages.Atthesameaccuracylevel,AE-GP,MKL-SVM
andUVAreturn28%,9%and4%imagesrespectively,i.e.wecanreturn7%moreannotation
atthishighlevelofoverlapthantheclosestcompetitor. Producingveryaccuratebounding-
box annotations is important for their intended use as training data for various models and
tasks. ImprovingoverAE-GPvalidatestheproposedideaofscoringcandidatewindowsby
10 VEZHNEVETSANDFERRARI:LOOKINGOUTOFTHEWINDOW
Tiger Hartebeest Skunk
Barrow Guenon
Wild boar Buckle Pool cue Racket
Figure5: Qualitativeresults. Resultsofourmethod(red)vsAE-GP[35](green). Notice,howour
methodisabletodetectsmall,off-centerobjectsdespiteocclusion(poolcue)ortheobjectblending
withitssurroundings(tiger).
taking into account spatial relations to all other windows and their appearance similarity.
The favourable comparison to the excellent method UVA [32] and to its extended MKL-
SVMversiondemonstratesthatoursystemofferscompetitiveperformance.Exampleobjects
localizedbyourmethodandbyAE-GPareshowninfig.5.Ourmethodsuccessfullyoperates
inclutteredimages(guenon,barrow,skunk). Itcanfindcamouflagedanimals(tiger),small
objects (buckle, racket), and deal with diverse classes and high intra-class variation (pool
cue,buckle,racket).
Toevaluatetheimpactofourfastinferencealgorithm(sec.4.1)wecomparedittobrute
force (i.e. evaluating the energy for all possible configurations) on the baseball class. On
averagebruteforcetakes17.6sperimage,whereasourfastinferencetakes0.14s. Sinceour
inferencemethodisexact,itproducesthesamesolutionasbruteforce,but124×faster.
6.3 Conclusion
WehavepresentedanewmethodforannotatingthelocationofobjectsinImageNet,which
goesbeyondconsideringonecandidatewindowatatime. Instead,itscoreseachwindowin
thecontextofallotherwindowsintheimage,takingintoaccounttheirsimilarityinappear-
ancespaceaswellastheirspatialrelationsintheimageplane. Aswehavedemonstratedon
92KimagesfromImageNet,ourmethodimprovesoversomeofthebestperformingobject
localizationtechniques[32,35],includingtheonewebuildon[35].
Acknowledgment ThisworkwassupportedbytheERCStartingGrantVisCul. A.Vezh-
nevetswasalsosupportedbySNSFfellowshipPBEZP-2142889.
References
[1] Imagenet large scale visual recognition challenge (ILSVRC). http://www.image-
net.org/challenges/LSVRC/2011/index,2011.
[2] University of amsterdam and euvision technologies at ilsvrc2013 (ILSVRC).
http://www.image-net.org/challenges/LSVRC/2013/slides/ILSVRC2013-UvA-
Euvision-web.pdf,2013.