Relief R-CNN : Utilizing Convolutional Feature Interrelationship for Fast Object Detection Deployment GuiyingLi,JunlongLiu,ChunhuiJiang ZexuanZhu LiangpengZhang and KeTang CollegeofComputerScienceandSoftwareEngineering SchoolofComputerScienceandTechnology ShenzhenUniversity UniversityofScienceandTechnologyofChina Shenzhen518060,China Hefei,Anhui230027,P.R.China Email:[email protected] Email:[email protected] 6 1 Abstract Hosang,Benenson,andSchiele2014).Evenworse,thereare 0 R-CNNstylemodelsarethestate-of-the-artobjectdetection not too many acceleration methods for them. These short- 2 methods, which consist of region proposal generation and ages make a trained R-CNN style model uneasy to be de- p deep CNN classification for objects. The proposal genera- ployedoncomputationallysensitiveplatforms. e tionphaseinthisparadigmisusuallytimeconsumingandnot Many approaches have been proposed to accelerate R- S convenienttobedeployedonhardwareswithlimitedcompu- CNN, but they are still not superior enough to be de- tationalability.Thisarticleshowsthatthehigh-levelpatterns ployed on low-end hardwares with respect to the real- 0 of feature value in deep convolutional feature maps contain time requirement. Some approaches focus on simplifying 2 plenty of useful spatial information, and propose a simple manuallydesignedapproachthatcanextracttheinformation the structure of CNN models (Han, Mao, and Dally 2016; ] for fast region proposal generation. The proposed method, Kim et al. 2016). Some focus on adapting the combina- V dubbedReliefR-CNN(R2-CNN),isanewdeeplearningap- tion of CNN and ROIs, such as Fast R-CNN (Girshick C proachforobjectdetection.Byextractingpositionsofobjects 2015) and Faster R-CNN (Ren et al. 2015). Others focus . from high-level convolutional patterns, R2-CNN generates on reducing the computation of ROI generation, such as s regionproposalsandperformsdeepclassificationsimultane- somewindowsbasedmethods(Hosangetal.2015).FastR- c ouslybasedonthesameforwardingCNNfeatures,unifiesthe CNN is an accelerated version of R-CNN. It reconstructs [ formerlyseparatedobjectdetectionprocessandspeedsupthe the combination of ROIs and CNN by directly mapping 3 wholepipelinewithoutadditionaltraining.Empiricalstudies the region proposals to a ROI layer inside the deep CNN v showthatR2-CNNcouldachievethebesttrade-offbetween model. Faster R-CNN integrates proposal generation pro- 9 speedanddetectionperformanceamongallthecomparison cessintotheFastR-CNNmodelbymeansofaregionpro- 1 algorithms. posal network(RPN). However, similar to other large scale 7 neuralnetworks,FasterR-CNNisprohibitiveforcomputa- 6 Introduction tion sensitive platforms. For ROI generation, compared to 0 . One type of the state-of-the-art deep learning methods for traditional grouping based methods (Uijlings et al. 2013; 1 object detection is R-CNN (Girshick et al. 2014) and its Hosang et al. 2015), some windows based methods are 0 derivativemodels(Girshick2015;Renetal.2015).R-CNN faster, such as Bing (Cheng et al. 2014) and Edgeboxes 6 consists of two separated procedures: reduced object pro- (Dolla´randZitnick2015).Allthesemethodscannotbeef- 1 posal generation and object classification. The object pro- ficientdeployonlow-endhardwares. : v posal generation focuses on finding the region of inter- In this paper, we propose Relief R-CNN(R2-CNN) to i ests(ROIs)(Renetal.2015)whichmaycontainobjects.The X speed up the deployment of ROI generation in any trained classification phase receives the generated ROIs for a deep R-CNNstylemodelswithoutanyextratraining.R2-CNNis r a CNN (Krizhevsky, Sutskever, and Hinton 2012) to classify inspired by the similarity between relief sculptures in real theseROIsasspecificobjectsorbackground. life and feature maps in CNN. Visualization of convolu- However,R-CNNmaysufferfromitslowefficiency,not tional layers(Zeiler and Fergus 2014; Simonyan, Vedaldi, only due to the inelegant combination of ROIs and CNN, and Zisserman 2013; A.Dosovitskiy, J.T.Springenberg, and but also the high computation cost of ROIs generation and T.Brox2015;MahendranandVedaldi2015)hasshownthat deepCNNprocess.Ontheonehand,commonlyuseddeep convolutional features with high values in a trained CNN CNN models are computationally intensive and only effi- directly map to the recognizable objects on input images. cient on special hardwares(e.g. GPU, TPU) (Krizhevsky, Therefore, R2-CNN utilizes these convolutional feature in- Sutskever,andHinton2012;SimonyanandZisserman2015; terrelationshipsforregionproposalgeneration.Itisdoneby He et al. 2015b; Ioffe and Szegedy 2015; He et al. 2015a; directly extracting the local region wrapping features with Szegedyetal.2015).Ontheotherhand,variousregionpro- high values as ROIs. This approach is faster than many posalmethodsarealsotimeconsuming(Hosangetal.2015; othermethods,sinceaconsiderablylargepartofitscompu- Copyright(cid:13)c 2017,AssociationfortheAdvancementofArtificial tations are comparison operations instead of time consum- Intelligence(www.aaai.org).Allrightsreserved. ingmultiplicationoperation(Renetal.2015;Jiaetal.2014; Recursive Fine-tuning(Proposals Refinement) predicted bounding box <x_min, y_min, x_max, y_max> conv_5+pool_5 conv_2+pool_2 conv_3 conv_4 object confidence Fast R-CNN conv_1+pool_1 ROI full connections Proposals Refinement Big box Integral Feature Map Small box feature value increase Feature levels Extracting Proposals from Feature Interrelationship Region Proposals Figure1:OverviewofReliefR-CNN.GeneratinganIntegralFeatureMapbasedonfeaturemapsinpool1layer,followedby separatingfeaturesofIntegralFeatureMapintodifferentFeatureLevels.Finally,extractingBigBoxesandSmallBoxesas regionproposalsandusingadditionalproposalrefinementtechniquesforbetterperformance.Theprocessconductedwithsolid linesistheprocedureofFastR-CNN,whiletheprocessalongwithdottedlinesistheoperationofR2-CNN Ghodratietal.2015;Dai,He,andSun2015).Furthermore, Grouping based methods, which are based on grouping R2-CNNusestheconvolutionalfeaturesproducedbyCNN pixeldetailsforregionproposals,areusuallytimeconsum- for ROI generation, while most of the methods need addi- ing in practice. Selective Search(SS) (Uijlings et al. 2013; tional feature extraction from raw images for ROIs. In a Van de Sande et al. 2011) is the state-of-the-art grouping word, R2-CNN could reduce much more computations in based method, which is highly used in various detection ROIgenerationphasecomparedtoothermethodsdiscussed models as baseline (Girshick et al. 2014; Girshick 2015; above. Hosangetal.2015).SSutilizesthesuper-pixelsforgroup- The basic structure of R2-CNN is shown in Figure 1. A ingbymanuallydesignedmergingprocess.Itmergespixel- trainedFastR-CNNisthebasemodelwhichneedsaccelera- detailsintovariantobjectproposals. tionindeployment.R2-CNNacceleratestheROIgeneration Windows scoring methods, which may need few data- roughlyby3steps. dependent training before testing, are usually faster than • Firstly, R2-CNN combines the feature maps in the first grouping based methods, since they can take advantage convolutionallayerasanIntegralFeatureMap,forthe of some data-dependent priors for locating objects. These purposeofdenoisingandsimplifyingfurtherprocessing. methodsobtainsomecoarsewindowcandidatesforobjects • Secondly,R2-CNNextractsthelocalregionswhichcon- atfirst.Thosecoarsewindowscanbepredefined(e.g.Bing) orquicklygenerated(e.g.EdgeBoxes).Then,thesemethods tain the features significantly higher than neighbor fea- use the high level features from images such as objectness turesasROIs. (Alexe, Deselaers, and Ferrari 2010) to judge whether an • Thirdly, the generated ROIs are transferred to the ROI imagewindowcontainsanobject.Objectness(Alexe,Dese- layerinFastR-CNNasinputs. laers, and Ferrari 2012; 2010), a well known method, gen- The main contributions of R2-CNN can be summarized eratesaninitialwindowsetfromthesalientlocationsinan into 2 points. The first is proposing a real-time region pro- image,thenexploitsimagecuestoscoretheinitialwindow posal generator with competitive qualities. The second is set. Bing (Cheng et al. 2014) presents a very fast approach revealing that, not only the convolutional features values byonlygeneratingproposalswithpredefinedwindowsize, butalsotheconvolutionalfeatureinterrelationshipsinCNN, andmeasuresobjectnesswithasimplelinearmodeltrained containprettymuchusefulinformation. over edges features. Edgeboxes (Zitnick and Dolla´r 2014; Dolla´r and Zitnick 2015; 2013) generates limited coarse RelatedWork windows,andthenevaluatingthewindowsbyobjectbound- Many studies have been devoted to the trade-off between aryestimationandsomefine-tuningtechniques. speed and accuracy of region proposal generation(Hosang Grouping based methods and Windows scoring methods et al. 2015; Rahtu, Kannala, and Blaschko 2011). Most of do not cause too much time budget in training. Faster R- them require prior knowledge to formulate rules for region CNN(Renetal.2015),onthecontrary,spendslotsoftime proposalsextraction.Twogeneralapproachesinthoseearly intrainingtoproduceadata-dependentregionproposalnet- studies are Grouping based methods and Windows scor- work(RPN) for quick and accurate testing. This technique ing methods (Hosang et al. 2015; Hosang, Benenson, and adopted by RPN is not practical for fast deployment on Schiele2014). computation limited equipments, since the speed of neural networkprocessingoncomputationlimitedplatformsisre- stricted. Faster R-CNN providesa more compact combina- tion of ROI generation and object classification, and casts object detection acceleration problem as deep neural net- work acceleration problem which is still not solved for re- sourcelimitedenvironments. TheproposedR2-CNNinthisarticle,forthepurposeof acceleratingtestingphasewithoutadditionaltrainingeffort, Figure 2: Different appearances of feature maps. Not all utilizesthenetwork-dependentpriortolocateobjectsintest- the feature maps focus on the objects of interest. Left is ing. It achieves a better trade-off between time and speed more interested in context information; Right does not fil- than previous approaches. Further more, it makes proposal ter all the background away; Middle is exactly what we generationnomoreobstacleforplatformswithlimitedcom- want. Although different feature maps have different inter- putationalability. ests, the boundary features in all these maps are described. Theboundaryfeaturesmaynotbeobviousinglobal,butsig- ReliefR-CNN nificantinlocalacrossdifferentmaps. In this section we present the details of R2-CNN. Figure 1 showsthebriefstructureofR2-CNN. 4 AugmentingROInumbersbyLocalSearch. GeneralIdea 5 PromotingtheperformancebyRecursiveFine-tuning. Observationsonconvolutionalfeaturemaps(ZeilerandFer- gus2014;MahendranandVedaldi2015;Gatys,Ecker,and Figure 3 shows a simple illustration of steps 1∼3. The fol- Bethge 2015) have shown that, the features related to rec- lowingsectionswilldescribethese5stepsindetail. ognizable objects have significant higher values than the InitialProposalSet nearby features from the context. Similar pattern can be seeninrelief:thebackgroundisfilteredaway.Thekeyparts Step1.IntegralFeatureMapGeneration of the vision are portrayed on the canvas. The objects are A synthetic feature map called Integral Feature Map distinguished by the boundaries which present significant is generated by adding all feature maps up to one map. It heightthannearbyelements.Theelementsinthesamekey wouldbeaverytimeconsumingtasktoprocessallmapsin part have similar heights. These similarities inspire us that thechosenlayer.Furthermore,notallthefeaturemapscon- therelativelocationinformationoffeaturevaluesinfeature taintheinformationaboutobjects.Somemapsaremorere- mapscanbetreatedasakindofedgeinformation,sotheob- latedtocontextinformation(seeFig.2),soitismeaningless jectscouldbecapturedfromtheconvolutionalfeaturemaps, to extract region proposals from such noisy feature maps. ratherthantherawimage.Therefore,onlyoncefeatureex- Therefore, the Integral Feature Map could be used for traction from image is necessary compared to most of the thepurposeofeliminatingnoisymapsandreducingtime. othermethodsneedtwiceforCNNprocessandROIgenera- The generation of Integral Feature Map consists tionseparately.Thiswilldramaticallyreducefeatureextrac- of two steps. Firstly, each feature map is normalized tiontime. by dividing by its maximal feature value. Secondly, an In this section, a manually designed proposal extraction IntegralFeatureMapisgeneratedbyaddingallthenor- method inspired by relief sculpture is proposed. The spe- malizedfeaturemapstogether. cificmethodadoptedinthisarticlehighlyrelyonthehyper Step2.SeparatingFeatureLevelsbyFeatureInterre- parameters related to Alexnet (Krizhevsky, Sutskever, and lationship Hinton2012)usedinexperiments,butthemainideacanbe FeaturesintheIntegralFeatureMaparegroupedinto applied to any CNN architectures which fit the observation several feature levels by uniformly separating their value discussedabove. range.AsdiscussedinGeneralIdea,edgedetailsofobjects The main idea of the proposed method is simple. By canbedescribedbytherelativerelationshipamongfeatures. searching the regions that are significant more salient than Inthiscontext,therelativerelationshipbetweentwofeatures nearby context features in convolutional feature maps, we indicates how large the gap between values of the two fea- can obtain some edge details of objects, and locate the ob- turesis.Thesesignificantgapswithhighmagnitudesinfea- jectsinthesourceimagebyutilizingthesedetails.R2-CNN turemapsaresomekindsofedgedetailswhichcharacterize canbesummarizedinto5stepsasfollow,step1∼3generate thecontoursofobjects. aninitialproposalset,step4∼5refinetheproposalset: However,there is not a precise threshold to determine whether gaps are “significantly” higher than others. R2- 1 Generating an Integral Feature Map from the first CNN deals with this problem by separating features into poolinglayerfordenoisingandfastprocessinginfollow- severalfeaturelevels,sothatthefeaturevaluesineachfea- ingsteps. turelevelaresignificanthigherorlowerthanthoseinother 2 Separating the features in the Integral Feature Map levels. It means that features in the same level are highly intodifferentFeatureLevels. possibletobepartofthesameobjectorsimilarobject.The 3 GroupingallthefeaturesinaFeatureLevelasdifferent features in the same feature level form a hierarchical sub ROIs. featuremap,asshowninFigure3. These feature levels in an Integral Feature Map Big Box f aregeneratedbydividingthevaluerangeofallthe integral featuresintoseveralsubranges.Eachsubrangeisaspecific levelwhichcontainspartoffeaturesinf .Thenum- integral Small Box berofsubrangeisahyperparameterl,R2-CNNuniformly dividesthef intolfeaturelevels,seeAlgorithm1. integral Algorithm1FeatureLevelSeparation CNN Forwarding Input: IntegralFeatureMapf integral Input: FeatureLevelNumberl conv_1+pool_1 Sample of Feature Levels 321::: FF(cid:3)iinnddGiinneggttittnhhgeetmmheianxviiammluaaellvrvaaanllugueeevovafalflueueaetmmuiranexviinanlfufieinnstteiegngrrfaalilntegral CNN Classifier Max value 4: (cid:3) Valuerangeis(valuemax−valuemin) 5: (cid:3) Uniformlydividingthevaluerangeintolsubranges separate value ranges 6: stride=(valuemax−valuemin)/l 7: (cid:3) featuresineachsubrangeformafeaturelevel 8: fori=1→ldo Integral Feature Map 9: Finding features bigger than valuemin +(i−1)∗ feature value increase Feature Levels 0 stride and smaller than value + i ∗ stride in min Figure3:FeaturelevelsseparatingandBig/SmallboxesEx- f asfeautre integral leveli tracting.Featurelevelsseparatingshowsthevaluedistribu- 10: featureleveliisthefeatureleveliforfintegral tion of features in a feature map, the same important parts 11: endfor of a feature map are grouped in the same hierarchical sub 12: return<featurelevel1,...,featurelevell > featuremap.OneBigBoxforeachsubfeaturemapcanbe takenasobjectcomposedofkeyparts.Therearealsomany Step3.BoundingBoxesGeneration Small Boxes in each sub feature map since small objects Bounding boxes for objects are generated by combining mayappearasfeatureclustersinthesubmap.Featurelevels the features in each feature level. The techniques for com- areuniformlyseparatedintotenlevelsinexperiment. bination are elaborating designed on the observations dis- cussedabove. Figure3showssomesamplesoffeaturelevelsgenerated ProposalRefinement from the first pooling layer of CaffeNet model(CaffeNet is a caffe implementation of AlexNet (Krizhevsky, Sutskever, R2-CNN provides a fast ROI generation for testing while andHinton2012)).Inthesesamples,eachbrightpixelinfea- thepre-traineddeepmodelisconvergedintrainingbyother turelevelsisafeature.Furthermore,thepositionconnected proposalmethod(e.g.SS).Itisobviousthattheaccuracyof features, which are surely belong to the same object, may testingisrestrictedbecauseofthedifferentproposalsdistri- be an important part of an object or even represent a small butionbetweentrainingandtesting.Owingtothisfact,some object.Therefore,proposalscanbegeneratedbyassembling fine-tuningmethodsshouldbeappliedtotheinitialproposal thesepositionconnectedfeatures,ordirectlyusingtheareas setforabetterdetectionrate. containedsuchfeaturesasproposals.Consideringthesetwo The purpose of R2-CNN is to accelerate testing phase approaches of proposal generation, two types of proposals withoutmoreresourceconsumingtothewholeR-CNNstyle aregeneratedforaspecificfeaturelevelfeature ,the model.Therefore,afine-tuningmethoddeployedintesting leveli processisshowninFigure3: phaseismorereasonablethanadeepmodelretrainedbythe windowcandidatesfromtheproposedmethod.Thefollow- • Big Box Proposal: This comes from the idea of assem- ing section introduces two proposal refinement techniques blingareasofpositionconnectedfeatures.Insteadofvar- fortheinitialproposalset. ious possibilities that combine areas into boxes, this ap- Step4.LocalSearch proachcombinesalltheareasofpositionlinkedfeatures Convolutional features from source image are not pro- into one Big Box for the feature . Such a method leveli duced by seamless sampling. As a result, bounding boxes can deal with the big object which cover a large propor- extractedinconvolutionalfeaturemapsarequitecoarseaf- tionoftargetimage. ter mapping to source images. Local Search in width and • SmallBoxProposals:Thiscomesfromtheideaofdirectly heightisappliedtotacklethisproblem.Foreachregionpro- usingtheareasofpositionconnectedfeaturesaspropos- posal,localsearchalgorithmneedstwoscaleratiosαandβ als.Firstly,itsearchesthefeatureclusters(namelythepo- togenerate4moreproposalsbyscalingitswidthandheight sitionconnectedfeatures)inthegivenfeatureleveli,and according to scale ratios. In experiments, α is fixed to 0.8 thenmappingthefeatureclustersasSmallBoxes. andβisfixedto1.5.TheLocalSearchcangiveabout1.8% AlltheBigBoxesandSmallBoxesconstitutetheinitialpro- mAPimprovementindetectionperformance. posalset. Step5.RecursiveFine-tuning Table1:TestingTimecomparison.TheobjectdetectionmodelusedhereisFastR-CNN.TheR2-CNNneedsrecursivefine- tuningwhichmakesclassificationbetime-consuming.“TotalTime”isthesumofvaluesin“ProposalTime”and“Classification Time”.“*”indicatestheruntimereportedin(Hosangetal.2015).NumberofrecursiveloopstoR2-CNNwassetto3.“RPN” istheproposalgenerationmodelusedinFasterR-CNN.Boldmeansthebesttimecost. Methods ProposalTime(sec.) Proposals ClassificationTime(sec.) TotalTime(sec.) R2-CNN 0.00048 760.19 0.146 0.14648 Bing 0.2* 2000 0.115 0.315 EdgeBoxes 0.3* 2000 0.115 0.415 RPN 1.445 2000 0.115 1.560 Objectness 3* 2000 0.115 3.115 SelectiveSearch 10* 2000 0.115 10.115 Acommonlyusedboundingboxrefinementtechniquefor thesame.Theevaluationcodeusedforgeneratingfigure4,5 R-CNN style models is the bounding box regressor. The wasalsopublishedby(Hosangetal.2015). trained regressor predicts the corrections for the input pro- In experiments, the Fast R-CNN (Girshick 2015) was posalssothatitwillfittheobjectsmuchbetter.Theprocess based on CaffeNet which was a Caffe (Jia et al. 2014) ver- of a trained regressor in deep model can be formalized as sionofAlexNet(Krizhevsky,Sutskever,andHinton2012). anoptimizationprocessROI = Regressor(ROI ).In TheFastR-CNNmodelwastrainedwithSSjustthesameas out in thisequation,ROI istheinputproposals,andROI is in(Girshick2015).TheFasterR-CNN(Renetal.2015)used in out the refined proposals which have a better overlapping rate in experiments was based on project py-faster-rcnn (rbgir- overgroundtruthboundingboxes. shick2016),whichwasapythonversionofFasterR-CNN. Fast R-CNN directly uses the ROI as the predicted Despite the difficulty of Faster R-CNN for low power out objectboundingbox.Thisisfinewhenproposalsintraining devices, RPN of Faster R-CNN is still one of the state-of- and testing are generated by the same method. However, it the-artproposalmethods.Therefore,RPNwasstilladopted isnotastableoptimizationprocessfromthecontrolsystem in experiments using the same Fast R-CNN model consis- pointofview,becauseitdoesnotprovideanyfeedbacksto tent with other methods for detection. The RPN in experi- verify whether the result is optimal or not, so there is no mentswastrainedonthefirststageofFasterR-CNNtraining guarantee of the performance in testing if the method for phases.ThisparadigmistheunsharedFasterR-CNNmodel proposalsgenerationaredifferentfromtraining.Therefore, mentionedin(Renetal.2015).In(Renetal.2015),there- wedevelopedaclose-loopboundingboxregressor,namely gionproposalsneedNMS(non-maximumsuppression)after the Recursive Fine-tuning step. This step links the ROI generation,whichisnotapartofRPNmodelitselfandre- out as the input of Regressor(ROI ) again, and only stops quiresmoregenerationtime,hencetheproposalsusedhere in whendetectionperformanceofROI converges. aretherawproposalswithoutNMS. out Therecursivefine-tuningisaverysimplestep.Itdoesnot Itshouldbenoticedthatalltheproposalgenerationparts needanychangestoexistingR-CNNstylemodels,butjusta wereprocessedinCPU(includingR2-CNNandRPN)while recurrentlinkfromtheoutputofatrainedboxregressorback thedeepneuralnetworksofclassificationswereprocessedin to its input. Briefly speaking, it is a trained box regressor GPU.AllthedeepneuralnetworkshadrunononeNVIDIA wrappedupintoaclosed-loopsystemfromaR-CNNstyle TitanXGPUcard,andtheCPUusedintheexperimentswas model. IntelE5-2650V2with8cores,2.6Ghz. SpeedPerformance Experiments Table1containstheresultsofcomparisonabouttimeintest- Setup ing. The testing time is separated into proposal time and Inthissection,wecomparedourR2-CNNwithsomestate- classification time. The proposal time is the time cost for proposal generation, and the classification time is the time of-the-art object detection methods. All experiments were costforverifyingalltheproposals. tested on PASCAL VOC 2007 (Everingham et al. 2014) basedonFastR-CNNmodel. It should be noticed that the time costs of convolutional For the R2-CNN in the experiments, the number of re- layersareincludedinclassificationtime,sothattheproposal generationtimeforR2-CNNintable1isonlytheproposal cursive loops was set to 3, and the number of feature lev- calculation time based on the convolutional features as in- elswas10.Foreasyimplementation,weusedanindividual puts.Thisisbecauseofthesharedfeaturenatureofproposal CaffeNet to do proposals extraction, not exactly as in Fig generationandclassificationinR2-CNN. 1 where the CaffeNet for feature extraction was combined insidetheFastR-CNN. DetectionPerformance TheproposalsofBing,Objectness,EdgeBoxesandSelec- tive search were the pre-generated proposals published by Table 2 has shown the detection performances of R2-CNN (Hosang et al. 2015), since the the algorithm settings were and other comparison methods. Precision (O¨zdemir et al. 2010) is a well known metric to evaluate the precision of 1 1 0.9 0.9 predictions, mAP(mean Average Precision) is a highly ac- 0.8 0.8 ceFTetaapbsattlle.eRd2-20eC:1vN4aDl)Nue.atAdeticelolttientochnitenioprtnehesermufoolotrbsdmjeienlacnbttcaadebseelteodefc2otiRnwoCn2e-ratCeaffNseekvNNa(eRluCt.uaostemsadpkbaoyrvestkhtoye recall at IoU threshold 0.50000000......234567 recall at IoU threshold 0.80000000......234567 0.1 0.1 Other Methods. R2-CNN shows comparable detection per- 0 0 formance.BoldresultsaretheresultsofR2-CNN 101 # proposal1s02 101 # proposal1s02 (a)IoU0.5 (c)IoU0.8 1 1 Methods mAP meanPrecision 0.9 0.9 BEidnggeBoxes BORRESediPb2nlgjNe-geecCcBttNinvoNeexseSssearch 445555140357......249850 218945.....72229 recall at IoU threshold 0.700000000.......2345678 average recall0000000.......2345678 OORSeabulhjree tMccutteinvteehsSosdearch 0.1 0.1 0 0 101 102 101 102 # proposals # proposals The empirical results in table 1 and 2 reveal that R2- (b)IoU0.7 (d)AverageRecall CNN outperforms other methods considering the trade-off between time and detection performance. R2-CNN could Figure4:RecalltoProposalscurveonVOC07 achieves a very competitive detection performance com- pared to state-of-the-art SS and EdgeBoxes with a much 1 morefastspeed. 0.9 BEidnggeBoxes Objectness Proposalquality 0.8 Our Method Rahtu 0.7 SelectiveSearch To evaluate the quality of proposals, two commonly used 0.6 evaluationmetrics(Hosangetal.2015)areadopted. ecall0.5 1 One is the curves of Recall-to-Proposals under different r0.4 IoU thresholds. Recall (O¨zdemir et al. 2010) is a well 0.3 knownevaluationtoanalyzehowmanygroundtruthob- 0.2 jectsarefound. 0.1 2 TheotherevaluationmetricisRecall-to-IoUcurve. 0 0.5 0.6 0.7 0.8 0.9 1 IoU overlap threshold The metric IoU means intersection over union (Rus- sakovskyetal.2014),itisanevaluationcriteriontomeasure Figure 5: Recall to IoU threshold with 200 proposals in howsimilartwoboxesare. count.R2-CNNhadnearlydominatedothermethods. Figure4containsRecall-to-Proposalscurves.Itcouldbe found that R2-CNN was more stable under different IoU thresholdscomparedtootherwindowsbasedmethods.Bing inthissectionhaveshownthatR2-CNNcangetaverygood andObjectnessonlyperformedwellundertheIoU thresh- performanceinlimitproposalssituationwithahighspeed. old0.50.EdgeBoxes,inIoU threshold0.70,alsoperformed adifferenttrendcomparedtothreshold0.5and0.8.R2-CNN Conclusion performedmorestablethanthesewindowsbased methods, the differences among R2-CNN and other methods were This paper presents a unified object detection model called continuallydecreasingasIoU thresholdbecamelarger.For Relief R-CNN(R2-CNN), which is based on the similar- SS,itperformedasstableasR2-CNN,butR2-CNNalways ity between convolutional feature maps and sculpted re- gotbetterrecallwhenproposalsnumberwaslessthanabout liefs. By directly extracting region proposals from convo- 200 proposals in all the IoU thresholds we evaluated. In lutionalfeatureinterrelationship,namelythelocationinfor- summary,ontheaveragerecallcurvewecouldfindthat,R2- mationofsalientfeaturesinlocalregion,R2-CNNreduces CNNperformedbetterthanallothermethods. the ROI generation time in deployment of R-CNN style Figure 5 is Recall-to-IoU curves, it could be found that models. Empirical studies demonstrated that R2-CNN was R2-CNN had nearly dominated other methods in IoU faster than previous works with competitive detection per- thresholdbetween0.5∼0.9,andbecamethesecondarybest formance.Moreover,noadditionaltrainingbudgettoorigi- inIoU threshold0.9∼1.0. nalFastR-CNNbaselinemodelwasneeded.Theresultsof It should be noticed that R2-CNN can not control the experiments also revealed a new thinking that information numberofproposals,butitgetsthebestresultwithhundreds isnotonlypresentedasfeaturesbutalsotheorganizationof ofproposalswhileothersneedthousands.Theexperiments featurepositionsinadeepCNN. References Kim,Y.-D.;Park,E.;Yoo,S.;Choi,T.;Yang,L.;andShin, D. 2016. Compression of deep convolutional neural net- A.Dosovitskiy;J.T.Springenberg;andT.Brox. 2015. Learn- worksforfastandlowpowermobileapplications. ICLR. ing to generate chairs with convolutional neural networks. InCVPR. IEEE. Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Imagenet classification with deep convolutional neural net- Alexe, B.; Deselaers, T.; and Ferrari, V. 2010. What is an works. InNIPS,1097–1105. object? InCVPR,73–80. IEEE. Mahendran,A.,andVedaldi,A. 2015. Understandingdeep Alexe, B.; Deselaers, T.; and Ferrari, V. 2012. Measuring imagerepresentationsbyinvertingthem. CVPR. theobjectnessofimagewindows. PAMI34(11):2189–2202. O¨zdemir, B.; Aksoy, S.; Eckert, S.; Pesaresi, M.; and Cheng, M.-M.; Zhang, Z.; Lin, W.-Y.; and Torr, P. 2014. Ehrlich,D.2010.Performancemeasuresforobjectdetection Bing:Binarizednormedgradientsforobjectnessestimation evaluation. PatternRecognitionLetters31(10):1128–1137. at300fps. InCVPR,3286–3293. IEEE. Rahtu, E.; Kannala, J.; and Blaschko, M. 2011. Learning Dai, J.; He, K.; and Sun, J. 2015. Convolutional feature a category independent object detection cascade. In ICCV, maskingforjointobjectandstuffsegmentation. CVPR. 1052–1059. IEEE. Dolla´r,P.,andZitnick,C.L.2013.Structuredforestsforfast rbgirshick. 2016. py-faster-rcnn project. https:// edge detection. In Proceedings of the IEEE International github.com/rbgirshick/py-faster-rcnn. ConferenceonComputerVision,1841–1848. Ren, S.; He, K.; Girshick, R.; and Sun, J. 2015. Faster r- Dolla´r,P.,andZitnick,C.L.2015.Fastedgedetectionusing cnn:Towardsreal-timeobjectdetectionwithregionproposal structured forests. IEEE transactions on pattern analysis networks. InNIPS,91–99. andmachineintelligence37(8):1558–1570. Russakovsky,O.;Deng,J.;Su,H.;Krause,J.;Satheesh,S.; Everingham, M.; Eslami, S. A.; Van Gool, L.; Williams, Ma,S.;Huang,Z.;Karpathy,A.;Khosla,A.;Bernstein,M.; C.K.;Winn,J.;andZisserman,A. 2014. Thepascalvisual et al. 2014. Imagenet large scale visual recognition chal- objectclasseschallenge:Aretrospective. IJCV111(1):98– lenge. IJCV1–42. 136. Sermanet,P.;Eigen,D.;Zhang,X.;Mathieu,M.;Fergus,R.; Gatys,L.A.;Ecker,A.S.;andBethge,M. 2015. Aneural and LeCun, Y. 2013. Overfeat: Integrated recognition, lo- algorithmofartisticstyle.arXivpreprintarXiv:1508.06576. calizationanddetectionusingconvolutionalnetworks.arXiv preprintarXiv:1312.6229. Ghodrati, A.; Diba, A.; Pedersoli, M.; Tuytelaars, T.; and VanGool,L. 2015. Deepproposal:Huntingobjectsbycas- Simonyan,K.,andZisserman,A. 2015. Verydeepconvolu- cadingdeepconvolutionallayers. InICCV,2578–2586. tionalnetworksforlarge-scaleimagerecognition. ICLR. Girshick, R.; Donahue, J.; Darrell, T.; and Malik, J. 2014. Simonyan,K.;Vedaldi,A.;andZisserman,A. 2013. Deep Richfeaturehierarchiesforaccurateobjectdetectionandse- insideconvolutionalnetworks:Visualisingimageclassifica- manticsegmentation. InCVPR,580–587. IEEE. tionmodelsandsaliencymaps. ICLRWorkshop2014. Girshick,R. 2015. Fastr-cnn. InICCV. Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; and Rabinovich, Han, S.; Mao, H.; and Dally, W. J. 2016. Deep com- A. 2015. Goingdeeperwithconvolutions. pression: Compressing deep neural network with pruning, Uijlings,J.R.;vandeSande,K.E.;Gevers,T.;andSmeul- trainedquantizationandhuffmancoding. ICLR2. ders, A. W. 2013. Selective search for object recognition. He, K.; Zhang, X.; Ren, S.; and Sun, J. 2015a. Deep IJCV104(2):154–171. residual learning for image recognition. arXiv preprint VandeSande,K.E.;Uijlings,J.R.;Gevers,T.;andSmeul- arXiv:1512.03385. ders,A.W. 2011. Segmentationasselectivesearchforob- He, K.; Zhang, X.; Ren, S.; and Sun, J. 2015b. Delving jectrecognition. InICCV,1879–1886. IEEE. deepintorectifiers:Surpassinghuman-levelperformanceon Zeiler,M.D.,andFergus,R. 2014. Visualizingandunder- imagenetclassification. standingconvolutionalnetworks. InECCV.Springer. 818– Hosang,J.;Benenson,R.;andSchiele,B. 2014. Howgood 833. aredetectionproposals,really? InBMVC. Zitnick, C. L., and Dolla´r, P. 2014. Edge boxes: Locating Hosang,J.;Benenson,R.;Dolla´r,P.;andSchiele,B. 2015. objectproposalsfromedges. InECCV.Springer. 391–405. Whatmakesforeffectivedetectionproposals? PAMI. Ioffe,S.,andSzegedy,C.2015.Batchnormalization:Accel- eratingdeepnetworktrainingbyreducinginternalcovariate shift. InBlei,D.,andBach,F.,eds.,ICML,448–456. JMLR WorkshopandConferenceProceedings. Jia, Y.; Shelhamer, E.; Donahue, J.; Karayev, S.; Long, J.; Girshick,R.;Guadarrama,S.;andDarrell,T. 2014. Caffe: Convolutional architecture for fast feature embedding. In ACMMultimedia,675–678. ACM.