Table Of ContentDeep Region Hashing for Efficient Large-scale Instance Search from Images
JingkuanSong TaoHe LianliGao
ColumbiaUniversity UniversityofElectronicScience UniversityofElectronicScience
jingkuan.song@gmail.com andTechnologyofChina andTechnologyofChina
tao.he@uestc.edu.cn lianli.gao@uestc.edu.cn
XingXu HengTaoShen
7
UniversityofElectronicScience UniversityofElectronicScience
1
0 andTechnologyofChina andTechnologyofChina
2
xing.xu@uestc.edu.cn shenhengtao@hotmail.com
n
a
J
6 Abstract Ithaslongbeenahotresearchtopicduetothemanyappli-
2 cationssuchasimageclassificationanddetection[33,35].
Instance Search (INS) is a fundamental problem for Early research [20] usually extracts image-level hand-
]
V manyapplications,whileitismorechallengingcomparing craftedfeaturessuchascolor,shapeandtexturetosearcha
totraditionalimagesearchsincetherelevancyisdefinedat
C userquery.Recently,duetothedevelopmentofdeepneural
the instance level. Existing works have demonstrated the
. networks,usingtheoutputsoflastfully-connectednetwork
s success of many complex ensemble systems that are typi-
c layers as global image descriptors to support image clas-
[ callyconductedbyfirstlygeneratingobjectproposals,and sification and retrieval have demonstrated their advantage
then extracting handcrafted and/or CNN features of each overpriorstate-of-the-art[18]. Morerecently,wehavewit-
1
proposalformatching. However,objectboundingboxpro-
v nessedtheresearchattentionshiftedfromfeaturesextracted
posals and feature extraction are often conducted in two
1 from the fully-connected layers to deep convolutional lay-
0 separatedsteps,thustheeffectivenessofthesemethodscol- ers[1], anditturnsoutthatconvolutionallayeractivations
9 lapses. Also,duetothelargeamountofgeneratedpropos-
havesuperiorperformanceinimageretrieval. However,us-
7 als, matching speed becomes the bottleneck that limits its
ingimage-levelfeaturesmayfailtosearchinstancesinim-
0
application to large-scale datasets. To tackle these issues,
. ages. AsillustratedinFig. 1, sometimesatargetregionis
1 inthispaperweproposeaneffectiveandefficientDeepRe-
onlyasmallportionofadatabaseimage. Directlycompar-
0
gionHashing(DRH)approachforlarge-scaleINSusingan
ingthequeryimageandawholeimageinthedatabasemay
7
image patch as the query. Specifically, DRH is an end-to-
1 resultaninaccuratematching. Therefore,itisnecessaryto
: enddeepneuralnetworkwhichconsistsofobjectproposal, considerthelocalregioninformation.
v
featureextraction,andhashcodegeneration. DRHshares
i Recently,manysuccessfulINSapproaches[29,17]were
X full-image convolutional feature map with the region pro-
proposed by combining fast filtering with more computa-
r posal network, thus enabling nearly cost-free region pro-
tionallyexpensivespatialverificationandqueryexpansion.
a posals. Also, each high-dimensional, real-valued region
Morespecifically,allimagesinadatabasearefirstlyranked
features are mapped onto a low-dimensional, compact bi-
accordingtotheirdistancestothequery,andthenaspatial
narycodesfortheefficientobjectregionlevelmatchingon
verification is applied to re-rank the search results, and fi-
large-scale dataset. Experimental results on four datasets
nally a query expansion is followed to improve the search
show that our DRH can achieve even better performance
performance. Anintuitivewayforspatialverificationisthe
than the state-of-the-arts in terms of MAP, while the effi-
sliding window strategy. It generates sliding windows at
ciencyisimprovedbynearly100times.
differentscalesandaspectratiosoveranimage. Eachwin-
dowisthencomparedtothequeryinstanceinordertofind
the optimal location that contains the query. A more ad-
1.Introduction
vancedstrategyistouseobjectproposalalgorithms[32,7].
Large-scale instance search (INS) is to efficiently re- It evaluates many image locations and determines whether
trieve the images containing a specific instance in a large theycontaintheobjectornot.Therefore,thequeryinstance
scaleimagedataset, givingaqueryimageofthatinstance. onlyneedtocomparewiththeobjectproposalinsteadofall
1
consists of layers for feature extraction, object proposal,
and hash code generation. DRH can generate nearly cost-
free region proposals because the region proposal network
(a)
(b) sharesfull-imageconvolutionalfeaturemap. Also,thehash
Sim(Hq,Hr) codegenerationlayerlearnsbinarycodestorepresentlarge
region
scaleimagesandregionstoaccelerateINS.Itisworthwhile
tohighlightthefollowingaspectsofDRH:
Sim(Hq,Hg) (c) • Weproposeaneffectiveandefficientend-to-endDeep
Region Hashing (DRH) approach, which consists of
objectproposal,featureextraction,andhashcodegen-
Sim(Hq,Hr)==Sim(Hq,Hg) ?? eration,forlarge-scaleINSwithanimagepatchasthe
query. Givenanimage,DRHcanautomaticallygener-
Figure1.Purelyusingglobalhashcodestocalculatethesimilar-
atethehashcodesforthewholeimageandtheobject
ity between query instance and database images is not accurate
candidateregions.
enough. (a)isaqueryinstancewithadeephashcoderepresen-
tationH ;(c)isadatabaseimagewithadeepglobalregionhash
q • We design different strategies for object proposals
codeH bytakingthewholeimageasanregion;(b)isanregion
g based on the convolutional feature map. The region
pooledfrom(c)andwithadeepregionhashcodeH. However,
l
proposal is nearly cost-free, and we also integrate
it is clear that conduct INS search only using global region (c)
information is inaccurate and the deep region hashing should be hashing strategy into our approach to enable efficient
considered as well. Thus, in this paper, we aim to learn a deep INSonlarge-scaledatasets.
regionhashingtosupportfastandefficientINS.
• Extensive experimental results on four datasets show
that our generated hash codes can achieve even bet-
the image locations. Existing works [29, 30] have demon-
ter performance than the state-of-the-arts using real-
stratedthesuccessofmanycomplexensemblesystemsthat
valued features in terms of MAP, while the efficiency
are conducted by firstly generating object bounding box
isimprovedbynearly100times.
proposalsandthenextractinghandcraftedand/orCNNfea-
tures for each bonding box. Because the region detection
2.RelatedWork
andfeatureextractionareconductedintwoseparatesteps,
thedetectedregionsmaynotbetheoptimaloneforfeature
Our work is closed related to instance retrieval, object
extraction,leadingtosuboptimalresults.
proposalandhashing. Webriefreviewtherelatedworkfor
FasterR-CNN[26]introducesaregionproposalnetwork eachofthem.
which integrates feature extraction and region proposal in Instanceretrieval. MostapproachesinINSrelyonthe
a single network. The detection network shares the full- bag-of-words(BoW)model[23,20,17],whichisoneofthe
imageconvolutionalfeatures,thusenablingnearlycost-free mosteffectivecontent-basedapproachesforlargescaleim-
region proposals. Inspired by this, if a neural network can ageandvisualobjectretrieval. Typically,theyarebasedon
simultaneousextracttheconvolutionalfeaturesandpropose localfeaturessuchasRootSIFT[29]andfewarebasedon
someobjectcandidates,wecanperforminstancesearchon CNNfeatures. Forinstance, [17]proposestoconductINS
these candidates. However, efficiency is a big issue for byencodingtheconvolutionalfeaturesofCNNusingBoW
FasterR-CNN.ThesuccessfullyselectivesearchandRPN aggregation scheme. More recently, research has shown
are still generating hundreds to thousands of candidate re- thatVLAD[11]andfishervector[19]performsbetterthan
gions. If a dataset has N images and each image contains BoW[29,12,30]. Specifically,BoWencodesalocalrepre-
M regions, each region is represented by a d-dimensional sentationbyindexingthenearestvisualword,whileVLAD
featurevector,thenanexhaustivesearchforeachqueryre- andfishervectorquantizethelocaldescriptorwithanextra
quiresN×M×doperationtomeasurethedistancestothe residualvectorobtainedbysubtractingthemeanofthevi-
candidateregions. Therefore,despitetakingtheadvantages sual word or a Gaussian component fit to all observations
of deep convlutional layer activations and region propos- [29, 30]. However, embedding methods have drawbacks,
als,theexpensivecomputationalcostfornearestneighbors e.g.,inhibitingtruepositivematchesbetweenlocalfeatures,
search in high-dimensional data limits their application to consistingoflotsofparametersandsufferingfromoverfit-
largescaledatasets. ting[30].
InthispaperweproposeafastandeffectiveDeepRegion Several works are proposed to go beyond handcrafted
Hashing(DRH)approachforvisualinstancesearchinlarge features for image retrieval and INS. [13, 5] trained deep
scale datasets with an image patch as the query. Specif- CNNs and adopted features extracted from the fully con-
ically, DRH is an end-to-end deep neural network which nectedlayerstoenhanceimageretrieval.Recently,research
Deep Region Hashing Network HashLoss Dataset Hash Code List
Conv1 010101 110011
Conv2 region
Conv3
CCoonnv4v5 PRoeogliionng Hash 010111 010111
Layer
layer regrki-o1n 011110 010111
Unsupervised 111011
Global
Region Search LRHR, QE
010101 010101
010101
010101
Query Instance
Training Dataset Testing Dataset Query Instance Hash Code Result List
Figure2.TheoverviewofDRH.Itcontainstwoparts,i.e.,deepregionhashingnetworkpartandinstancesearchpart.Thefirstparttrains
anetworkforfeatureextraction,objectproposalandhashcodegeneration,whilethesecondpartutilizesDRHtogetthequeryresults.
attention shifted from extracting deep fully-connected fea- withanabjectnessscores. Thisregionproposalnetworkis
tures to deep convolutional features [25, 36, 4]. For in- trainedonthePASCALVOCdatasetandachieveshighac-
stance, [1] aggregates local deep features to produce com- curacy. A comprehensive survey and evaluation protocol
pact global descriptors for image retrieval. [25] provides forobjectproposalmethodscanbefoundin[3].
an extensive study on the availability of image representa- Hashing. Existing hashing methods can be classified
tionsbasedonconvolutionalnetworksforINS.Inaddition, intotwocategories:unsupervisedhashing[34,6]andsuper-
[2] demonstrates that the activations invoked by an image visedhashing[34,15,28,6,37]. Acomprehensivesurvey
within the top layers of a large convolutional neural net- andcomparisonofhashingmethodsarerepresentedin[35].
work provide a high-level descriptor of the visual content CNN have achieved great success in various tasks such as
oftheimage. Also,PCA-compressedneuralcodesoutper- imageclassification,retrieval,trackingandobjectdetection
form compact descriptors compressed on handcrafted fea- [3,4,2,5,18,36],andtherearefewhashingmethodsthat
turessuchasSIFT.Inaddition,[27]exploresthesuitability adopts deep models. For instance, Zhao et. al.[37] pro-
for INS of image and region-wise representations pooled posed a deep semantic ranking based method for learning
from the trained Faster R-CNN or fine-turned Faster R- hash functions that preserve multilevel semantic similarity
CNNwithsupervisedinformationbycombiningre-ranking betweenmulti-labelimages. Linet. al. [14]proposeanef-
andqueryexpansions. fectivesemantics-preservingHashingmethod,whichtrans-
Objectproposals.Objectproposalshavereceivedmuch formssemanticaffinitiesoftrainingdataintoaprobability
attention recently. It can be broadly categorized into two distribution. To date, most of the deep hashing methods
categories: super-pixels based methods (e.g., Selective are supervised and they tried to leverage supervised infor-
Search [32] and Multiscale Combinatorial Grouping [22]) mation (e.g., class labels) to learn compact bitwise repre-
andslidingwindowbasedapproaches(e.g.,EdgeBoxesand sentations,whilefewmethodsarefocusedonunsupervised
Faster R-CNN [26]). More specifically, selective search methods, which used unlabeled data to learn a set of hash
adoptstheimagestructureforguidingsamplingprocessand functions. However,theproposeddeepunsupervisedhash-
captures all possible locations by greedily merging super- ing method are focused on the global image presentations
pixels based on engineered low-level features. Due to its without considering the locality information which we be-
goodperformance,ithasbeenwidelyusedasaninitialstep lievetobeimportantforINS.
for detection such as R-CNN [7] and INS task [29]. Even
though MCG achieves the best accuracy, it takes about 50 3.Methodology
seconds for each image to generate region proposals. By
contrast, Faster R-CNN proposed a region propsoal net- This paper explores instance search from images using
work,whichtakesanimageasinput,sharesacommonset hash codes of image regions detected by an object detec-
of convolutional layers with object detection network, and tionCNNorslidingwindow. Inoursetup,queryinstances
finally outputs a set of rectangular objects proposals, each aredefinedbyaboundingboxoverthequeryimages. The
framework of our DRH-based instance search is shown in globalregionimagedescriptor, whichwillbeusedas
Fig.2,andithastwophases,i.e.,offlineDRHtrainingphase theinputforglobalregionhashcodegeneration.
and online instance search phase. In this section, we de-
scribethesetwophasesindetails.
• RPN network. Faster R-CNN [26] proposed a Re-
gion Proposal Network (RPN), which takes conv fea-
3.1.DeepRegionHashing
turemapasinputandoutputsasetofrectangularob-
AsisshowninFig.2,theDRHtrainingphaseconsistsof jectproposals,eachwithanabjectnessscores. Givena
threecomponents,i.e.,CNN-basedfeaturemapgeneration, convfeaturemapW ×H ,itgeneratesW ×H ×9
c c c c
regionproposalsandregionhashcodegeneration.Next,we (typically216,00regions). Specifically,theRPNpro-
illustrateeachofthem. posed by Faster R-CNN is able to propose object re-
gions and extracts the convolutional activations for
each of them. Therefore, for each region proposal, it
3.1.1 CNN-basedRepresentations
can compose a descriptor by aggregating the activa-
Asmentionedabove,wefocusonleveragingconvolutional tions of that window in the RoI pooling layer, giving
networks for hash learning. We adopt the well known ar- raise to the region-wise descriptors. In order to con-
chitecture in [5] as our basic framework. As shown in duct image-level operation, we obtain a global region
Fig. 2, our network has 5 groups of convolution layers, 4 bysettingthewholeconvmapasaregion. Toconduct
maxconvolution-poolinglayers,1RegionofInterestPool- instance-level search, we obtain local regions which
inglayer(RoI)and1hashinglayer. Following[5],weuse havesmallersizethantheglobalregion.
64, 128, 256, 512, 512 filters in the 5 groups of convolu-
tional layers, respectively. More specifically, our network Afterregionproposalgenerationeitheradoptingsliding
firstly processes an arbitrary image with several convolu- window based approach or RPN approach, for each image
tional(conv)andmaxpoolinglayerstoproduceaconvfea- weobtainasetofregionswhichincludeaglobalregionand
ture map. Next, for each region proposal, a region of in- several local regions. Next, we apply max pooling to the
terest(RoI)poolinglayerextractsafixed-length(i.e.,512) correspondingregionwindowontheconvmaptogenerate
featurefromthefeaturemap. Eachfeaturemapisthenfed ROIfeatureforeachregion.
into the region hash layer that finally outputs hash codes
representingthecorrespondingRoI.
3.1.3 RegionHashingLayer
3.1.2 RegionPoolingLayer Forinstancesearch,directlyusingRoIfeaturestocompare
thequeryfeaturewithalltheROIfeaturesiseffectiveandit
Therearetwowaystogenerateregionproposals: 1)sliding
shouldworkwellinsmall-scaledataset. However,interms
window(thenumberofregionproposalisdeterminedbya
oflarge-scaledataset,thisapproachiscomputationallyex-
slidingwindowoverlapingparameterλ)and2)theRegion
pensive and requires large memory since it involves huge
ProposalNetwork(RPN)proposedbyFaterRCNN[26]. It
vector operations. In this paper, we aim to improve INS
providesasetofobjectproposals, andweusethemasour
search accuracy and efficiency. Inspired by the success of
regionproposalsbyassumingthatthequeryimagealways
hashing in terms of search accuracy, search time cost and
containssomeobjects.
space cost, we propose a region hashing layer to generate
• Sliding window. Given a raw image with the size hashcodesfortheregionproposals.
of W ×H, our network firstly generates a conv fea- Here,weconverteachRoIfeaturetoasetofbinaryhash
turemapWc×Hc,whereW,Wc andH,Hc arethe code. Given a RoI feature xr ∈ R1×512, the latent region
widthandheightoftheoriginalimageandconvmap, hashlayeroutputishr = σ(Wrxr +br),whereWr isthe
respectively. Next, we choose windows of any possi- hash function and br is the bias. Next, we perform the bi-
ble combinations of width (cid:8)Wc,W2c,W3c(cid:9) and height naryhashfortheoutputhr ∈R1×Lr toobtainabinarycode
(cid:8)Hc,H2c,H3c(cid:9). We use a sliding window strategy di- yr =sgn(σ(Wrxr+br))∈{0,1}Lr. Toobtainaqualified
rectlyontheconvmapwithoverlapinbothdirections. binary hash code, we propose the following optimization
The percentage of overlap is measured by a overlap- problem:
ping parameter λ. Next, we utilize a simple filtering 1 α β η
min (cid:96)= g(y ,h )− t(h )+ r(W )+ o(b ) (1)
strategytodiscardthosewindows. If WH ≤th,wedis- Wr,br 2 r r 2 r 2 r 2 r
cardthescale W3c.When WH ≤th,wediscard H3c,oth- whereg(yr,hr)isthepenaltyfunctiontominimizetheloss
erwisewekeeptheoriginalsettings. Whenthewidth betweentheRoIfeatureh andy .t(h ), r(W )ando(b )
r r r r r
and height are set to W and H , we can generate an aretheregularizationtermforh , W andb . α, β andη
c c r r r
areparameters. Wedefineg(y ,h )andt(h )asbellow: generated by the whole image region. Next, the image
r r r
g(y ,h )=(cid:107)y −h (cid:107)2 querylistissortedbasedonthehammingdistanceofdataset
r r (cid:16) r (cid:17)r F (2) itemstothequery. AftergDRH,wechoosetopM images
t(hr)=tr hrhrT toformthe1strankinglistXrank =(cid:8)xrank,··· ,xrank(cid:9).
1 M
Theregularizetermsr(W )ando(b )aredefinedas: Local Region Hashing Re-search (lDRH). After we
r r
r(W )=(cid:107)W (cid:107)2 obtained the Xrank, we perform the local region hashing
r r F (3) searchbycomputingthehammingdistancebetweenquery
o(b )=(cid:107)b (cid:107)2
r r F instance hash code and hash codes of each local region in
To solve this optimization problem, we employ the thetopM images. Toclarify,thehashingqueryexpansions
stochastic gradient descent method to learn parameters introducedbelowisoptional. Foreachimage,thedistance
Wr,br. The gradient of the objective function in (2) with scoreiscalculatedbyusingtheminimalhammingdistances
respecttodifferentparametersarecomputedasfollowing: betweenthehashcodeofqueryinstanceandallthelocalre-
∂(cid:96) =((h −y −αh )(cid:12)σ(cid:48)(W h +b ))x T +βW gionhashcodeswithinthatimage. Finally, theimagesare
∂Wr r r r r r r r r
∂(cid:96) =(h −y −αh )(cid:12)σ(cid:48)(W h +b )+ηb re-orderedbasedonthedistancesandwegetthefinalquery
∂br r r r r r r r (4) listXrank(cid:48) =(cid:110)xrank(cid:48),··· ,xrank(cid:48)(cid:111).
1 1 M
where(cid:12)iselement-wisemultiplication. Theparameters RegionHashingQueryExpansions(QE).Inthisstudy,
W andb areupdatedwithgradientdescentalgorithmuntil wefurtherinvestigatetwoqueryexpansionstrategiesbased
r r
convergence. ontheglobalandlocalhashcodes.
• Global Region Hashing Query expansion (gQE). Af-
3.1.4 NetworkTraining ter the Global Region Search, we conduct the global
hashing query expansion, which is applied after the
Wetrainourmodelsusingstochasticgradientdescentwith
gDRH and before the lDRH. Specifically, we firstly
momentum, with back-propagation used to compute the
choosetheglobalhashcodeofthetopqimageswithin
gradientoftheobjectivefunction∇Loss(W ,b )withre-
r r the Xrank to calculate similarities between each of
spect to all parameters W and b . More specifically, we
r r themwitheachglobalhashcodeoftheXrankimages.
randomly initialize the RoI pooling layer and region hash- For xrank(cid:48), the similarity score is computed by using
ing layer by drawing weights from a zero-mean Gaussian i
the max similarity between xrank(cid:48) and the top q im-
distribution with standard deviation 0.01. The shared con- i
ages’ global region hash codes. Next, the Xrank is
volutional layers are initialized by pre-training a model (cid:110) (cid:111)
[24]. Next, we tune all the layers of the RoI layer and the re-ordered to Xrankq = xr1ankq,··· ,xrMankq and
hashing layer to conserve memory. In particular, for RPN finallyweconductlDRHontheXrankq list.
trainingwefixtheconvolutionallayersandonlyupdatethe
• Local Region Hashing Query Expansion (lQE). Here,
RPNlayersparameters.Thepascal voc2007datasetisused
totraintheRPNandforthisstudywechoosetheTop3˜00 to conduct local region hashing query expansion, ini-
tiallyweneedtoconductgDRHandlDRHtogetthe
regions as input to the region hash layer. For sliding win-
Xrank(cid:48). HeregQEisoptional. Next,wegetthetopq
dow based region proposal, we use different scales to di-
bestmatchedregionhashcodefromimageswithinthe
rectly extract regions and then input them into the region
Xrank(cid:48) list and then compare the q best region hash
hashlayer. Atthisstage,theparametersforconvlayersand
codes with all the regions of the Xrank(cid:48). Thus, we
regionpoolinglayersareobtained. Next,wefixtheparam-
choose the max value as the similarity score for each
etersofconvlayersandtheregionpoolinglayertotrainthe
regionhashlayer. Inaddition,learningrateissetto0.01in image and finally the Xrank(cid:48) is re-ranked based on
the experiments. The parameters α, β, η were empirically thosesimilarityscores.
setas100,0.001and0.001respectively.
4.Experiment
3.2.InstanceSearchusingDRH
Weevaluateouralgorithmonthetaskofinstancesearch.
ThissectiondescribestheINSpipelinebyutilizingDRH Firstly, we study the influence of different components of
approach. ThispipelineconsistsofaGlobalRegionHash- ourframework. Then,wecompareDRHwithstate-of-the-
ing (gDRH) search, Local Region Hashing (lDRH) re- art algorithms in terms of efficiency and effectiveness on
rankingandtwoRegionHashingQueryExpansions(QE). twostandarddatasets.
Global Region Hashing Search (gDRH). It is con-
4.1.DatasetsandEvaluationMetric
ducted by computing the similarity between the hash code
of a query and the global region hash codes of dataset im- We consider two publicly available datasets that have
ages. For each item in the dataset, we use the hash codes been widely used in previous work. 1) Oxford Build-
ings [20]. This dataset contains two sub-datasets: Oxford Table 1. The performance (mAP) variance with different code
lengthsLusinggDRH.
5k, which contains 5,062 high revolutionary (1024×768)
Method oxford5K Paris6K
images, including 11 different landmarks (i.e., a particular
gDRH(128-bits) 0.425 0.531
part of a building), and each represented by several possi-
gDRH(256-bits) 0.583 0.629
blequeries;andOxford105K,whichcombinesOxford5K
gDRH(512-bits) 0.668 0.724
with 100,000 distractors to allow for evaluation of scala-
gDRH(1024-bits) 0.748 0.773
bility. 2) Paris Buildings [21], which contains two sub-
datasetsaswell: Paris6k including6,300highrevolution- gDRH(4096-bits) 0.783 0.815
ary(1024×768)imagesofParislandmarks;andParis106 MaxConvFeature512 0.797 0.824
K, which is generated by adding 100,000 Flickr distractor
tune λ = [0.4,0.5,0.6,0.7] to get different number of ob-
images to Paris 6k dataset. Compared with Oxford Build-
ject proposals. We further study the mAP of different top
ings, it has images representing buildings with some simi-
M results. gDRHcangetsomeinitialretrievalresults,and
larities in architectural styles. The hashing layer is trained
inthissubsection,werefinetheresultsbyusinglDRH,i.e.,
on a dataset which is composed of Oxford 5K training ex-
gDRH+lDRH.Thereportedresults(Tab.2)areontheOx-
amplesandParis6ktrainingexamples.
ford 5k dataset. From these results, we can make several
For each dataset, we follow the standard procedures to
observations.
evaluate the retrieval performance. In this paper, we use
meanAveragePrecision(mAP)astheevaluationmetric.
• ComparedwithRPNbasedDRH,theslidingwindow
4.2.PerformanceusingDifferentSettings based DRH performs slightly better. It also requires
lessnumberofregionboxesanddoesnotneedtraining
There are several different settings affecting the perfor-
a RPN network, thus we adopt sliding window based
manceofouralgorithm. Tocomprehensivelystudytheper-
DRH as the default approach to conduct the follow-
formance of DRH with different settings, we further study
ing experiments. RPN is supposed to generate better
different components of DRH including: 1) Hash code
regionproposalsthanslidingwindow,butaworseper-
lengthL;2)regionproposalcomponentanditsparameters;
formanceisachievedintheexperiment. Onepotential
and3)queryexpansionanditssettings.
reason is that RPN is trained on pascal voc dataset,
which is robust to propose the object of ‘buildings’.
4.2.1 TheEffectofHashCodeLengthL On the other hand, the objects in ‘Oxford buildings’
are usually very large, thus they can be readily cap-
Firstly,weinvestigatetheinfluenceofhashcodelengthby
turedbyslidingwindows.
usinggDRHtoconductINS.WereportthemAPsofgDRH
withdifferentcodelengthL,andalsothemAPofINSusing
• For sliding window based DRH, λ affects the perfor-
themaxpoolingonconv5feature. Theresultsareshownin
mance. In general, the performance is not very sen-
Tab. 1. These results show that with the increase of hash
sitive to λ. The best performance is achieved when
codelengthLfrom128to4096,theperformanceofgDRH
λ = 0.6, and it only generates about 40 regions for
increases from 42.5% to 78.3% on the oxford 5k dataset,
eachimage. Inthefollowingexperiments,thedefault
whilethemAPincreasesfrom53.1%to81.5%ontheParis
settingforλis0.6.
6k dataset. Comparing L = 1024 to L = 4096, the code
lengthincreasedto4times,buttheperformanceisonlyim- • ThetopM resultshavenosignificantimpactonmAP
provedby3.5%and4.2%. Therefore,tobalancethesearch either for DRH. In general, with the increase of M,
efficiency and effectiveness, we use L = 1024 as the de- the performance increases as well. When M = 800,
fault setting. On the other hand, the best performance of the algorithm achieves the best results. When M =
gDRH(i.e., L = 4096)isslightlylowerthanthatofusing 400, the performance is 0.3% lower than M = 800.
max pooling on conv5 features, with 1.4% and 0.9% de- Considering the memory cost, we choose M = 400
creases,respectively. Thisisduetotheinformationlossof forthefollowingexperiments.
hashcodes.
4.2.2 TheEffectofRegionProposals 4.2.3 TheEffectofQueryExpansions
In this sub-experiment, we explore firstly the two ways of Inthissubsection,westudytheeffectofqueryexpansions.
regionproposal: slidingwindowbasedandRPNbasedap- Thisstudyaimstoexplore:theeffectoflDRH,theeffectof
proaches. Both of them can generate different number of q forgQE,theeffectofgQEandlQEforDRH,aswellas
objectproposals. ForRPN,wechoose300regionstogen- severalcombinations. Theexperimentresultsareshownin
erate region hash code for lDRH. For sliding window, we Tab.3. FromTab.3,wehavesomeobservationsbelow:
Figure3.ThequalitativeresultsofDRHsearchingon5Kdataset.7queries(i.e.,redcliffe camera,all souls,ashmolean,balliol,bodleian,
christ churchandhertford)andthecorrespondingsearchresultsareshownineachrow. Specifically,thequeryisshownintheleft,with
selectedtop9rankedretrievedimagesshownintheright.
Table2.TheinfluenceofTopM,λandthewayofregiongenera- Table3.Theeffectofqueryexpansions(gQEandlQE).
tion. Thisexperimentisconductedontheoxford5kdatasetwith oxford Paris
Settings gDRH gQE lDRH lQE
L=1024. 5k 6k
SlidingWindowbasedDRH RPN-DRH
yes no no no 0.745 0.773
gDRH+lDRH λ=0.4 λ=0.5 λ=0.6 λ=0.7 ∼300boxes NoQE
∼21boxes ∼36boxes ∼40boxes ∼60boxes yes no yes no 0.783 0.801
M=100 0.773 0.774 0.776 0.772 0.772 yes yes,q=4 no no 0.813 0.823
M=200 0.779 0.780 0.781 0.778 0.778
M=400 0.779 0.781 0.783 0.780 0.780 gQE yes yes,q=5 no no 0.809 0.828
M=800 0.782 0.784 0.786 0.783 0.783 (differentq) yes yes,q=6 no no 0.815 0.835
yes yes,q=7 no no 0.789 0.842
• ThelDRHimprovestheINS.IntermsofmAP,forox-
yes yes no no 0.815 0.835
ford5kandparis6kdatasets,itincreasesby3.8%and
gQE+lQE yes yes yes no 0.804 0.831
2.8%,respectively.
(q=6) yes no yes yes 0.833 0.819
• The gQE improves the performance of DRH. When yes yes yes yes 0.851 0.849
q =6,itperformsthebest(i.e.,85.1%)ontheOxford yes yes no no 0.789 0.842
5k dataset, and when q = 7 it gains the best perfor- gQE+lQE yes yes yes no 0.804 0.834
manceontheParis6kdataset. Inthefollowingexper- (q=7) yes no yes yes 0.826 0.821
iments,qissetas6. yes yes yes yes 0.838 0.854
• TheexperimentalresultsshowthatbycombininggQE • Feature Encoding approaches [8, 9, 18]. They aim
and lQE, DRH performs the best (i.e., 85.1% for ox- to reducing the number of vectors against which the
ford 5k and 85.4% for Paris 6k datasets). Therefore, query is compared. [9] designed a dictionary learn-
queryexpansionstrategycanimproveINS. ingapproachforglobaldescriptorsobtainedfromboth
SIFTandCNNfeatures. [18]extractedconvolutional
4.3.ComparingwithState-of-the-artMethods features from different layers of the networks, and
adoptedVLADencodingtoencodefeaturesintoasin-
To evaluate the performance of DRH, we compare our
glevectorforeachimage.
methodswiththestate-of-the-artdeepfeaturesbasedmeth-
ods. Thesemethodscanbedividedintofourcategories: • Aggregating Deep features [25, 1, 31, 12, 17, 27, 29,
Table4.Comparisontostate-of-the-artCNNrepresentation. Table5.Thetimecost(ms)fordifferentalgorithms
Oxford Oxford Paris Paris CNNfeature[31] DRH DRH
Methods Datasets
5K 105K 6K 106K 512-Dconv5 512-Dhash 1024-Dhash
CNN Iscenetal.-Keans[8] 0.656 0.612 0.797 0.757
Oxford105K 1078 3 12
Feature Iscenetal.-Rand[8] 0.251 0.437 0.212 0.444
Encod- Ngetal.[18] 0.649 - 0.694 - Paris106K 1137 3 12
ing Iscenetal.[9] 0.737 0.655 0.853 0.789
Razavianetal.[25] 0.556 - 0.697 - improved by integrating re-ranking and query expan-
Bebenkoetal.[1] 0.657 0.642 - - sion strategies. On the other hand, hashing methods
CNN Toliasetal.[31] 0.669 0.616 0.830 0.757
performstheworstintermsofmAP.Thisisprobably
Featues Kalantidisetal.[12] 0.684 0.637 0.765 0.691
duetotheinformationlossofhashcodes.
Aggre- Mohedanoetal.[17] 0.739 0.593 0.820 0.648
gation Salvadoretal.[27] 0.710 - 0.798 -
Taoetal.[29] 0.765 - - - • ThequalitativeresultsshowthatDRHcanretrievein-
Taoetal.[30] 0.722 - - - stances precisely, even for the cases when the query
Kalantidisetal. image is different from the whole target image, but
Integra 0.749 0.706 0.848 0.794
+GQE[12]
-tion similartoasmallregionofit.
Toliasetal.
with 0.773 0.732 0.865 0.798
+AML+QE[31]
Locaility 4.4.EfficiencyStudy
Mohedanoetal.[17]
and 0.788 0.651 0.848 0.641
QE +Rerank+LQE Inthissubsection,wecomparetheefficiencyofourDRH
Salvadoretal.[27]
0.786 - 0.842 - to the state-of-the-art algorithms. The time cost for in-
+CS-SR+QE
stance search usually consists of two parts: filtering and
Jegouetal.[10] 0.503 - 0.491 -
Hashing Liuetal.[16] 0.518 - 0.511 - re-ranking. For our DRH, the re-ranking is conducted us-
DRH(gDRH+lDRH) 0.783 0.754 0.801 0.733 ing M=400 candidates, each has around 40 local regions.
ours DRHAll 0.851 0.825 0.849 0.802
Therefore,thetimecostforre-rankingcanbeignoredcom-
30]. [25, 1, 31, 12] are focusing on aggregating deep paredwiththefilteringstep. Forthecomparingalgorithms,
convolutional features for image or instance retrieval. theyhavedifferentre-rankingstrategies,andsomeofthem
In [17], Mohedano et al. propose a simple instance donothavethisstep. Therefore,weonlyreportthetimefor
retrievalpipelinebasedonencodingtheconvolutional filteringstepofthesecomparingalgorithms.
features of CNN using the bag of words aggregation Wechoose[31]astherepresentativealgorithmfornon-
scheme (BoW). [27] explores the suitability for in- hashing algorithms. To make fair comparison, we imple-
stanceretrievalofimage-andregion-wisedescriptors ment the linear scan of the conv5 feature [31] and our
pooled from Faster RCNN. For [29, 30], they are fo- hash codes using Python and C++ respectively, without
cusingongenericinstancesearch. any accelerating strategies. All the experiments are done
onaserverwithIntel@Corei7-6700KCPU@4.00GB×8.
• Integrating with Locality and QE. Some of the pre- The graphics is GeGorce GTX TITAN X/PCle/SSE2. The
vious methods are extended by integrating re-ranking searchtimeisaveragenumberofmillisecondstoreturntop
and/orqueryexpansiontechniques[12,31,17,27]. M =400resultstoaquery,andthetimecostsarereported
inTab. 5.
• Hashing methods. Lots of hashing methods are pro-
There results show that directly using max pooling fea-
posedforimageretrieval[10,16].
tures extracted from conv5 to conduct INS search is much
ThecomparisonresultsareshowninTab.4andwealso slowerthanusingourhashcodes. Ittakes1078msonOx-
showsomequalitativeresultsinFig.3. Fromtheseresults, ford 105k dataset and 1137 ms on Paris 106k dataset for
wehavethefollowingobservations. [31]. Bycontrast,whenthecodelengthis512,ourDRHis
more than 300 times faster than [31], and it takes 3 ms on
• First, our approach performs the best on both two
bothOxford105k andParis106k datasets. Evenwhenthe
large-scale datasets oxford 105k and Paris 106k with
codelengthincreasesto1024,DRHcanachievenearly100
0.825 and 0.802, respectively. In addition, the per-
timesfasterthan[31].Therefore,ourmethodhastheability
formance on oxford 105k is higher than Tolias et al.
toworkonlargescaledatasets.
+AML+QE[31]witha9.3%increase.
• OnParis6kdataset,ourperformanceisslightlylower 5.Conclusion
thanthatofToliasetal. +AML+QE[31]by1.1%,but
In this work, we propose an unsupervised deep region
ontheOxford5kdataset,itoutperformsMohedanoet
hashing (DRH) method, a fast and efficient approach for
al.+Rerank+LQE[17]by6.3%.
large-scaleINSwithanimagepatchasthequeryinput. We
• TheperformanceofCNNfeaturesaggregationisbet- firstly utilize deep conv feature map to extract nearly 40
terthanCNNfeaturesingeneral,anditcanbefurther to generate hash codes to represent each image. Next, the
gDRH search, gQE, lDRH and lQE are applied to further [18] J. Y. Ng, F. Yang, and L. S. Davis. Exploiting local
improve the INS search performance. Experiment on four features from deep networks for image retrieval. CoRR,
datasetsdemonstratethesuperiorityofourDRHcompared abs/1504.05133,2015. 1,3,7,8
toothersintermsofbothefficiencyandeffectiveness. [19] F.Perronnin,Y.Liu,J.Snchez,andH.Poirier. Large-scale
image retrieval with compressed fisher vectors. In CVPR,
References 2010. 2
[20] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisser-
[1] A.BabenkoandV.S.Lempitsky. Aggregatingdeepconvo-
man.Objectretrievalwithlargevocabulariesandfastspatial
lutionalfeaturesforimageretrieval.CoRR,abs/1510.07493,
matching. InCVPR,2007. 1,2,6
2015. 1,3,7,8
[21] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman.
[2] A. Babenko, A. Slesarev, A. Chigorin, and V. S. Lempit-
Lostinquantization:Improvingparticularobjectretrievalin
sky.Neuralcodesforimageretrieval.CoRR,abs/1404.1777,
largescaleimagedatabases. InCVPR,2008. 6
2014. 3
[22] J.Pont-Tuset,P.Arbela´ez,J.Barron,F.Marques,andJ.Ma-
[3] N.Chavali,H.Agrawal,A.Mahendru,andD.Batra.Object-
lik. Multiscalecombinatorialgroupingforimagesegmenta-
proposalevaluationprotocolis’gameable’. InCVPR,June
tionandobjectproposalgeneration. InIEEETransactions
2016. 3
onPatternAnalysisandMachineIntelligence,2016. 3
[4] M.Cimpoi, S.Maji, andA.Vedaldi. Deepfilterbanksfor
[23] D.Qin, C.Wengert, andL.V.Gool. Queryadaptivesimi-
texturerecognitionandsegmentation. InCVPR,2015. 3
larityforlargescaleobjectretrieval. InCVPR,pages1610–
[5] A.Conneau,H.Schwenk,L.Barrault,andY.LeCun. Very
1617,2013. 2
deep convolutional networks for natural language process-
[24] F.Radenovic,G.Tolias,andO.Chum. CNNimageretrieval
ing. CoRR,abs/1606.01781,2016. 2,3,4
learnsfrombow:Unsupervisedfine-tuningwithhardexam-
[6] V.ErinLiong,J.Lu,G.Wang,P.Moulin,andJ.Zhou.Deep
ples. CoRR,abs/1604.02426,2016. 5
hashingforcompactbinarycodeslearning. InCVPR,2015.
[25] A.S.Razavian,J.Sullivan,A.Maki,andS.Carlsson.Visual
3
instanceretrievalwithdeepconvolutionalnetworks. CoRR,
[7] R.Girshick,J.Donahue,T.Darrell,andJ.Malik. Richfea-
abs/1412.6574,2014. 3,7,8
ture hierarchies for accurate object detection and semantic
[26] S.Ren,K.He,R.Girshick,andJ.Sun. FasterR-CNN:To-
segmentation. InCVPR,2014. 1,3
wards real-time object detection with region proposal net-
[8] A.Iscen,T.Furon,V.Gripon,M.G.Rabbat,andH.Je´gou.
works. InNIPS,2015. 2,3,4
Memory vectors for similarity search in high-dimensional
[27] A. Salvador, X. Giro-i Nieto, F. Marques, and S. Satoh.
spaces. CoRR,abs/1412.3328,2014. 7,8
Faster r-cnn features for instance search. In CVPR Work-
[9] A.Iscen,M.Rabbat,andT.Furon.Efficientlarge-scalesim-
shops,2016. 3,7,8
ilaritysearchusingmatrixfactorization. InCVPR,2016. 7,
[28] D.Song,W.Liu,R.Ji,D.A.Meyer,andJ.R.Smith. Top
8
rank supervised binary coding for visual search. In ICCV,
[10] H.Jegou,M.Douze,andC.Schmid. Hammingembedding
December2015. 3
andweakgeometricconsistencyforlargescaleimagesearch.
InECCV,pages304–317,2008. 8 [29] R.Tao,E.Gavves,C.G.M.Snoek,andA.W.M.Smeulders.
Locality in generic instance search from one example. In
[11] H. Jgou, M. Douze, C. Schmid, and P. Prez. Aggregating
CVPR,2014. 1,2,3,7,8
local descriptors into a compact image representation. In
CVPR,2010. 2 [30] R.Tao,A.W.M.Smeulders,andS.F.Chang.Attributesand
[12] Y. Kalantidis, C. Mellina, and S. Osindero. Cross- categoriesforgenericinstancesearchfromoneexample. In
dimensional weighting for aggregated deep convolutional CVPR,2015. 2,7,8
features. CoRR,abs/1512.04065,2015. 2,7,8 [31] G. Tolias, R. Sicre, and H. Je´gou. Particular object re-
[13] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet trievalwithintegralmax-poolingofCNNactivations.CoRR,
classification with deep convolutional neural networks. In abs/1511.05879,2015. 7,8
NIPS,pages1106–1114,2012. 2 [32] J. R. R. Uijlings, K. E. A. van de Sande, T. Gevers, and
[14] Z.Lin,G.Ding,M.Hu,andJ.Wang. Semantics-preserving A.W.M.Smeulders.Selectivesearchforobjectrecognition.
hashingforcross-viewretrieval. InCVPR,2015. 3 IJCV,104(2):154–171,2013. 1,3
[15] X.Liu,J.He,B.Lang,andS.F.Chang.Hashbitselection:A [33] A.VedaldiandA.Zisserman. Sparsekernelapproximations
unifiedsolutionforselectionproblemsinhashing.InCVPR, for efficient classification and detection. In CVPR, pages
pages1570–1577,2013. 3 2320–2327,2012. 1
[16] Z.Liu, H.Li, W.Zhou, R.Zhao, andQ.Tian. Contextual [34] J. Wang, S. Kumar, and S.-F. Chang. Semi-supervised
hashing for large-scale image search. IEEE Trans. Image hashing for large-scale search. IEEE Transactions on Pat-
Processing,23(4):1606–1614,2014. 8 ternAnalysisandMachineIntelligence,34(12):2393–2406,
[17] E. Mohedano, A. Salvador, K. McGuinness, F. Marque´s, 2012. 3
N. E. O’Connor, and X. Giro´ i Nieto. Bags of local con- [35] J. Wang, T. Zhang, J. Song, N. Sebe, and H. T. Shen. A
volutional features for scalable instance search. CoRR, surveyonlearningtohash. CoRR,abs/1606.00185,2016. 1,
abs/1604.04653,2016. 1,2,7,8 3
[36] L.Wang,W.Ouyang,X.Wang,andH.Lu. Visualtracking
withfullyconvolutionalnetworks. InICCV,2015. 3
[37] F.Zhao,Y.Huang,L.Wang,andT.Tan.Deepsemanticrank-
ingbasedhashingformulti-labelimageretrieval. InCVPR,
2015. 3