PublishedinProceedingsofIEEEInternationalConferenceonIntelligentRobotsandSystems(IROS),2015. (cid:13)c2015IEEE.Personaluseofthismaterialispermitted.PermissionfromIEEEmustbeobtainedforallotheruses,inanycurrentorfuturemedia,includingreprinting/republishingthismaterialforadvertisingorpromotionalpurposes,creating newcollectiveworks,forresaleorredistributiontoserversorlists,orreuseofanycopyrightedcomponentofthisworkinotherworks. On the Performance of ConvNet Features for Place Recognition Niko Su¨nderhauf, Sareh Shirazi, Feras Dayoub, Ben Upcroft, and Michael Milford Abstract—After the incredible success of deep learning in the computer vision domain, there has been much interest in applying Convolutional Network (ConvNet) features in robotic fieldssuchasvisualnavigationandSLAM.Unfortunately,there pool1 conv3 conv5 arefundamentaldifferencesandchallengesinvolved.Computer 5 visiondatasetsareverydifferentincharactertoroboticcamera 1 data, real-time performance is essential, and performance pri- 0 orities can be different. This paper comprehensively evaluates 2 and compares the utility of three state-of-the-art ConvNets on pool2 conv4 fc6 l the problems of particular relevance to navigation for robots; u viewpoint-invarianceandcondition-invariance,andforthefirst J Fig. 1: Convolutional neural networks can extract features time enables real-time place recognition performance using 9 ConvNets with large maps by integrating a variety of existing thatserveasholisticimagedescriptorsforplacerecognition. 2 (locality-sensitive hashing) and novel (semantic search space Wefoundthefeaturesfromlayerconv3toberobustagainst partitioning) optimization techniques. We present extensive appearance changes on a variety of datasets. The figure O] experiments on four real world datasets cultivated to evaluate showsanexamplesceneandextractedfeaturesfromdifferent each of the specific challenges in place recognition. The results network layers of the AlexNet network. R demonstrate that speed-ups of two orders of magnitude can . be achieved with minimal accuracy degradation, enabling s real-time performance. We confirm that networks trained for c semantic place categorization also perform better at (specific) [ place recognition when faced with severe appearance changes layersforsearchspacepartitioningandthemid-levelfeatures 3 and provide a reference for which networks and layers are for place matching under challenging conditions. Locality- v optimalfordifferentaspectsoftheplacerecognitionproblem. sensitive hashing of these features allows us to perform 8 robust place recognition against 100,000 known places with 5 I. INTRODUCTION 3 Hz. We provide a thorough investigation of the utility 1 Robots that aim at autonomous long-term operations over 4 of the individual layers in the ConvNet hierarchy under extendedperiodsoftime,suchasdays,weeks,ormonths,are 0 severe appearance and viewpoint variations and furthermore faced with environments that can undergo dramatic changes . compare three state-of-the-art networks for the task of place 1 intheirvisualappearanceovertime.Visualplacerecognition 0 recognition. We establish the following main results: – the ability to recognize a known place in the environment 5 1 usingvisionasthemainsensormodality–islargelyaffected 1) featuresfromthehigherlayersoftheConvNethierarchy : by these appearance changes and is therefore an active encode semantic information about a place and can be v research field within the robotics community. The recent exploited to significantly reduce place recognition time i X literature proposes a variety of approaches to address the by partitioning the search space, r challenges of this field [1]–[14]. Recent progress in the 2) a speed-up of two orders of magnitude can be achieved a computervisionandmachinelearningcommunityhasshown by approximating the cosine distance between features thatthefeaturesgeneratedbyConvolutionalNetworks(Con- with the Hamming distance over bitvectors obtained vNets, see Fig. 1 for examples) outperform other methods by Locality Sensitive Hashing, compressing the feature in a broad variety of visual recognition, classification and data by 99.6% but retaining 95% of place recognition detection tasks [15]. ConvNets have been demonstrated to performance; be versatile and transferable, i.e. even although they were 3) when comparing different ConvNets for the task of trainedonaveryspecifictargettask,theycanbesuccessfully place recognition under severe appearance changes, deployed for solving different problems and often even networks trained for the task of semantic place catego- outperform traditional hand engineered features [16]. Our rization [17] outperform the ConvNet trained for object paper leverages these astounding properties and introduces recognition; the first real-time capable ConvNet-based place recognition 4) features from the middle layers in the ConvNet hi- system. We exploit the hierarchical nature of ConvNet fea- erarchy exhibit robustness against appearance changes turesandusethesemanticinformationencodedinthehigher induced by the time of day, seasons, or weather condi- tions; and TheauthorsarewiththeARCCentreofExcellenceforRoboticVision, 5) features extracted from the top layers are more robust QueenslandUniversityofTechnology(QUT).http://www.roboticvision.org/ email:[email protected] with respect to viewpoint changes. In the following we review the related literature before ConvNet features from the matching strategy, conduct more describing the datasets and evaluation protocol used. We systematic experiments on various datasets, compare three analyzetheperformanceofindividualConvNetfeaturelayers different ConvNets, and contribute important algorithmic for place recognition on several challenging datasets in improvements which enable real-time performance. SectionIVandintroduceimportantalgorithmicperformance improvements in Section V before concluding the paper. III. PRELIMINARIES A. FeatureExtractionusingaConvolutionalNeuralNetwork II. RELATEDWORK For the experiments described in the following section, A. Place Recognition we deploy the AlexNet ConvNet [26] provided by Caffe The focus of research in place recognition has recently [29].Thisnetworkwaspre-trainedontheImageNetILSVRC moved from recognizing scenes without significant appear- dataset [24] for object recognition. It consists of five convo- ance changes [18]–[20] to more demanding, but also more lutional layers followed by three fully connected layers and realistic changing environments. Methods that address the a soft-max layer. The output of each individual layer can place recognition problem span from matching sequences be extracted from the network and used as a holistic image of images [1], [4]–[7], transforming images to become descriptor. Since the third fully connected and the soft-max invariant against common scene changes such as shadows layerareadoptedspecificallytotheILSVRCtask(theyhave [2], [8], [10]–[12], learning how environments change over 1000outputneuronsforthe1000objectclassesinILSVRC), time and predicting these changes in image space [3], [12], we do not use them in the following experiments. Table I [21], particle filter-based approaches that build up place lists the used layers and compares their dimensionality; Fig. recognition hypotheses over time [13], [14], [22], or build 1 displays some exemplar features extracted from different a map of experiences that cover the different appearances layers. All images are resized to 231×231 pixels to fit the of a place over time [9]. Learning how the appearance of expected input size of the ConvNet. the environment changes generally requires training data with known frame correspondences. [4] builds a database of B. Image Matching and Performance Measures observedfeaturesoverthecourseofadayandnight.[3],[21] Place recognition is performed by single-image nearest present an approach that learns systematic scene changes in neighborsearchbasedonthecosinedistanceoftheextracted order to improve performance on a seasonal change dataset. feature vectors. We analyze the performance in terms of [23] learns salient regions in images of the same place with precision-recallcurvesandF scores1.Incontrastto[30]we 1 different environmental conditions. Beyond the limitation explicitly do not apply sequence search techniques or other of requiring training data, the generality of these methods specialized algorithmic approaches to improve the matching is also currently unknown; these methods have only been performanceinordertoobservethebaselineperformanceof demonstrated to work in the same environment and on the the ConvNet features under investigation. same or very similar types of environmental change to that encountered in the training datasets. C. Datasets used in the Evaluation We used four different datasets in the experiments pre- B. Convolutional Networks sented in Section IV. From the summary of their character- A commonality between all the aforementioned ap- istics in Table II we can see that we carefully selected these proaches is that they rely on a fixed set of hand-crafted datasets to form three groups of varying condition changes: traditional features or operate on the raw pixel levels. How- Two datasets exhibit severe appearance change but virtually ever, a recent trend in computer vision, and especially in novariationinviewpoints,oneshowsviewpointchangesbut the field of object recognition and detection [15], [24], is to exploit learned features using deep convolutional networks 1A match is considered a positive if it passes a ratio test (ratio of the (ConvNets). It therefore appears very promising to analyze distances of the best over the second best match found in the nearest neighborsearch),andanegativeotherwise.Sinceeverysceneinthedatasets these features and experimentally investigate their feasibility hasagroundtruthmatch(i.e.therearenotruenegatives),everynegativeis for the task of place recognition. Convolutional network is afalsenegative.Amatchisatruepositivewhenitiswithin±1..5frames a well-known architecture and was proposed by LeCun et of the ground truth (depending on the frame rate of the recorded dataset) and a false positive otherwise. The running parameter for creating the PR al. in 1989 [25] to recognize hand-written digits. Several re- curvesisthethresholdontheratiotest. searchgroupshaverecentlyshownthatConvNetsoutperform classical approaches for object classification or detection that are based on hand-crafted features [15], [16], [26]– Layer Dimensions Layer Dimensions [28].Theavailabilityofpre-trainednetworkmodelsmakesit pool1 96×27×27 conv5 256×13×13 easy to experiment with such approaches for different tasks: pool2 256×13×13 fc6 4096×1×1 The software packages Overfeat [27] and Caffe [29] conv3 384×13×13 fc7 4096×1×1 conv4 384×13×13 provide network architectures pre-trained for a variety of recognition tasks. [30] was the first to consider ConvNets TABLE I: The layers from the AlexNet ConvNet used in for place recognition. Our investigation is more thorough, our evaluation and their output dimensionality. sincewecleanlyseparatetheperformancecontributionofthe only mild appearance changes, and three others feature both types of variations. 1) The Nordland Dataset: The Nordland dataset consists of video footage of a 10 hours long train journey recorded from the perspective of the front cart in four different seasons.Fig. 6gives animpression ofthesevere appearance changesbetweenthedifferentseasons.TheNordlanddataset St. Lucia: Average Performance 1.0 is a perfect experimentation dataset since it exhibits no 0.8 conv3 viewpoint variations whatsoever and therefore allows to test n aellgaboorirtahtmesdiosncupsusiroencoonfdtihtiiosndcahtaasnegte.s.FoSreeou[3r]efxopreraimmeonrtes Precisio00..46 ccfcoo6nnvv45 0.2 fc7 we extracted image frames at a rate of 1 fps and exclude 0.0 all images that are taken inside tunnels or when the train 0.0 0.2 0.4 0.6 0.8 1.0 Recall was stopped. We furthermore exclude images between the timestamps 1:00h-1:30h and 2:47h-3:30h since these form a Fig. 2: Top: Two example images from the St. Lucia dataset training dataset we use in parallel work. showing the same place. Bottom: Precision-recall curves 2) TheGardensPointDataset: TheGardensPointdataset averaged over all nine trials (recorded at different times has been recorded on the Gardens Point Campus of QUT in during the day, and several days apart). Layers conv3 and Brisbane. It consists of three traverses of the environment, conv4 perform almost perfectly (overlapping curves in the two during the day and one during the night. The night upper right corner). imageshavebeenextremelycontrastenhancedandconverted to grayscale in the process. One of the day traverses was recorded keeping on the left side of the walkways, while the otherdayandthenightdatasetshavebeenrecordedfromthe rightside.Thisway,thedatasetexhibitsbothappearanceand viewpoint changes. 3) The Campus Human vs. Robot Dataset: This dataset Gardens Point: day-right vs. night-right 1.0 was recorded in different areas of our campus (outdoor, pool1 0.8 pool2 office,corridor,foodcourt)oncebyarobotusingtheKinect n RGB camera and once by a human with a GoPro camera. Precisio00..46 ccoonnvv34 While the robot footage was recorded during the day, the conv5 0.2 fc6 human traversed the environment during dawn, resulting 0.00.0 fc70.2 0.4 0.6 0.8 1.0 in significant appearance changes especially in the outdoor Recall scenes. The GoPro images were cropped to contain only the center part of half of the size of the original images. Fig. 3: Extreme appearance changes between day and night 4) The St. Lucia Dataset: The St. Lucia dataset [31] has images. The nighttime images have been contrast enhanced been recorded from inside a car moving through a suburb and converted to grayscale in the process. Despite these in Brisbane at 5 different times during a day, and also on changes, layer conv3 still performs reasonably well. different days over a time of two weeks. It features mild viewpoint variations due to slight changes in the exact route takenbythecar.Moresignificantappearancechangesdueto on the street. the different times of day can be observed, as well as some changesduetodynamicobjectssuchastrafficorcarsparked IV. LAYER-BY-LAYERSTUDIES This section provides a thorough investigation of the utility of different layers in the ConvNet hierarchy for place Variationsin recognition and evaluates their individual robustness against Dataset Environment Appearance Viewpoint the two main challenges in visual place recognition: severe Nordlandseasons trainjourney severe none appearance changes and viewpoint variations. GPday-right–night-right campusoutdoor severe none GPday-left–night-right campusoutdoor severe medium A. Appearance Change Robustness CampusHuman–Robot campusmixed medium medium St.Lucia suburbanroads medium medium In a first set of experiments we analyze the robustness of GPday-left–day-right campusoutdoor minor medium thefeaturesfromdifferentlayersintheConvNetarchitecture against appearance changes. We conduct these experiments TABLE II: The datasets used in the evaluation form three on the following datasets: groups of varying severity of appearance and viewpoint 1) the Nordland dataset, using 5 different season pairings changes. that involve the spring or winter season Nordland: spring vs. winter 1.0 0.8 n01..80 ppooooll12Gardens Point: day-left vs. night-right Precision00..46 ppcooonoovll123 cffcco67nv5 Precisio00..46 cccooonnnvvv345 00..020.0 conv04.2 0.4 0.6 0.8 1.0 0.2 Recall fc6 0.00.0 fc70.2 0.4 0.6 0.8 1.0 1.0 Nordland: summer vs. winter Recall 0.8 n Fviiegw. p4o:inWthcehnancgoems,bpilnaicnegreexcotrgenmiteioanpppeerafroarnmceanccheabneggeinwsittho Precisio00..46 ppcooonoovll123 cffcco67nv5 deteriorate. This combination is still very challenging for 0.2 conv4 current place recognition systems. 0.00.0 0.2 0.4 0.6 0.8 1.0 Recall Nordland: spring vs. summer 1.0 Campus: human vs. robot 0.8 1.0 n Precision000...468 ccoonnvv34 ffcc67 Precisio0000....4602 ppccoooonnoovvll1234 cffcco67nv5 0.2 conv5 0.0 0.2 0.4 Recall 0.6 0.8 1.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 Recall Fig. 6: Place recognition across seasons on the Nordland dataset. Despite the extreme appearance changes conv3 Fig. 5: The Campus dataset was recorded with different performs acceptable, given that the image matching is based cameras by a robot during the day and a human at dawn. on a single frame nearest neighbor search only. Notice how The mid-level convolutional layers outperform the fully fc6 and fc7 fail completely. connected layers fc6 and fc7. B. Viewpoint Change Robustness Viewpoint changes are a second major challenge for 2) the Gardens Point dataset, using all 3 subsets, spanning visual place recognition systems. While point feature-based day and night methods like FABMAP are less affected, holistic methods 3) the St. Lucia dataset, using a total of 9 traversals from such as the one proposed here are prone to error in the varying daytimes, over the course of several weeks presence of viewpoint changes. In order to quantify the 4) theCampusdataset,usingfootagerecordedbyahuman viewpoint robustness of ConvNet features, we perform two andarobotatdifferenttimesoftheday,includingdawn experiments: Figures 2-6 show the resulting precision-recall curves for 1) We use the Nordland spring dataset to create synthetic all experiments. Table III summarizes the results further and viewpoint changes by using shifted image crops. compares the F scores with other state-of-the-art methods. 2) WeusetheGardensPointdataset(dayleftvs.dayright) 1 The experiments consistently show that the mid-level fea- to verify the observed effects on a real dataset. tures from layer conv3 are more robust against appearance We conduct the first experiment on the Nordland spring changes than features from any other layer. Both the lower dataset using images cropped to half of the width of the layers(pool1,pool2)andthehigherlayers(e.g.fc6and originalimages.Wesimulateviewpointchangesbetweentwo fc7) in the feature hierarchy lack robustness and exhibit traverses by shifting the images of the second traverse to inferiorplacerecognitionperformance.Wefurthercompared the right. This results in overlaps between the images of the our approach to SeqSLAM [1] a state-of-the-art method for simulatedfirstandsecondtraverseofbetween98%and20%. place recognition under extreme appearance changes and For these experiments, the first 5000 images from the spring foundthatsingleimagematchingusingfeaturesextractedby dataset(excludingtunnels,stoppagesandthetrainingdataset layer conv3 matches or exceeds SeqSLAM’s performance. mentioned before) were used. Fig. 7 illustrates the results Previouswork[3],[32]establishedthatFAB-MAP[18],[19] of the experiment with F-scores extracted from precision is not capable of handling the severe appearance changes of recallstatistics.InasecondexperimentweusedtheGardens the Nordland and St. Lucia datasets. Point day-left vs. day-right dataset that exhibits a lateral F1-Scores Dataset AlexNetLayers SeqSLAM FAB-MAP pool1 pool2 conv3 conv4 conv5 fc6 fc7 Nordland Spring–Winter 0.54 0.63 0.75 0.70 0.50 0.09 0.06 0.80∗ † Summer–Winter 0.49 0.51 0.61 0.55 0.34 0.07 0.04 0.64∗ † Fall–Winter 0.52 0.60 0.66 0.60 0.38 0.07 0.05 0.63∗ † Spring–Summer 0.75 0.89 0.93 0.91 0.83 0.49 0.32 0.86∗ † Spring–Fall 0.78 0.90 0.93 0.92 0.84 0.53 0.35 0.88∗ † GardensPoint dayleft–dayright 0.61 0.88 0.89 0.88 0.91 0.96 0.96 0.44 — dayright–nightright 0.66 0.89 0.94 0.93 0.92 0.82 0.73 0.44 — dayleft–nightright 0.32 0.64 0.68 0.66 0.65 0.65 0.56 0.21 — St.Lucia averageover9trials — — 0.98 0.98 0.95 0.81 0.72 — ‡ worsttrial — — 0.96 0.96 0.87 0.51 0.38 0.59 ‡ Campus human–robot — — 0.89 0.89 0.89 0.80 0.74 0.22 — TABLE III: Comparison of F -Scores between different layers and other state-of-the-art methods. †We established in [3] 1 that FAB-MAP fails on the Nordland dataset. The maximum measured recall was 0.025 at 0.08 precision. ‡ [32] reported that FAB-MAP is not able to address the challenges of the St. Lucia dataset, creating too many false positive matches. ∗For SeqSLAMontheNordlanddatasetweskippedeverysecondframe(duetoperformancereasons)andusedasequencelength of 5, resulting in an effective sequence length of 10. All other trials with SeqSLAM used a sequence length of 10. Nordland: Evaluation of viewpoint invariance Gardens Point: day-left vs. day-right 1.0 1.0 0.9 0.8 pool1 pool2 F1 score000...687 pool2 conv4 fc6 Precision00..46 cccooonnnvvv345 0.5 conv3 conv5 fc7 0.2 fc6 0.420 30 40 50 60 70 80 90 100 0.00.0 fc70.2 0.4 0.6 0.8 1.0 Overlap in % Recall Fig.7:Toprow:Examplesforthesyntheticviewpointvaria- Fig.8:Testingtheviewpointinvarianceonarealisticdataset. tionexperimentsconductedontheNordlandtraindataset.We Two traverses through the same environment were recorded, cropped the original images to half of their width and then oneontheleftandanotherontherightsideofthewalkway. created shifted versions with varying overlap. Bottom: F The top two images compare the difference in viewpoint 1 scores for different layers and overlap values. fc6 performs from the same place in both sequences. The plot shows that best, but all layers are invariant to small variations that the higher layers fc6 and fc7 perform best, maintain ≈90% overlap. camera movement of 2-3 meters for medium viewpoint convolution and pooling operations that occur in the first changes between two traversals of the environment. Only layer. In subsequent work [33] we addressed the scenario of minor variations in the appearance of the scenes can be viewpoint and appearance changes occurring simultaneously observed, mostly caused by people walking on the campus. using a number of new dataset. Fig. 8 shows the resulting precision recall curves. Both C. Summary and Discussion experiments show that features from layers higher in the ConvNet architecture, especially fc6, are more robust to Table IV summarizes our experiments. It is apparent that viewpoint changes than features from lower layers. The place recognition based on single image matching performs increased viewpoint robustness of the upper layers can be well even under severe appearance changes when using the accounted to the pooling layers that are part of the network mid-level conv3 layer as a holistic feature extractor. Fea- architecture and perform max-pooling after the first, second, turesfromthehigherlayers,especiallyfc6,onlyoutperform and fifth convolutional layer. The synthetic experiments conv3insituationswithviewpointchangesbutnoneoronly summarized in Fig. 7 also show that all layers are robust to mild appearance variations. This intuitively makes sense, mildviewpointchangeswithmorethan90%overlapbetween sincethefeaturesfromthefirstconvolutionallayersresemble scenes. This effect is due to the resampling (resizing) of very simple shape features [34] that are not discriminative images before they are passed through the network and the and generic enough to allow place recognition under severe appearance changes. The layers higher in the hierarchy, and the angle between these vectors. [36] demonstrates how the especially the fully connected layers, are more semantically cosine distance between two high-dimensional vectors can meaningful but therefore lose their ability to discriminate be closely approximated by the Hamming distance between between individual places within the same semantic type of the respective hashed bit vectors. The more bits the hash scene.ForexampleintheSt.Luciafootage,thehigherlayers contains, the better the approximation. We implemented wouldencodetheinformation’suburbanroadscene’equally this method and compare the place recognition performance for all images in the dataset. This enables place categoriza- achievedwiththehashedconv3featurevectorsofdifferent tion but is disadvantageous for place recognition. The mid- lengths (28...213 bits) on the Nordland and Gardens Point level layers – and especially conv3 – seem to encode just datasetsinFig.9.Using8192bitsretainsapproximately95% the right amount of information; they are more informative of the place recognition performance. Hashing the original and more robust to changes than pure low-level pixel or 64,896 dimensional vectors into 8192 bits corresponds to gradientbasedfeatures,whileremainingdiscriminateenough a data compression of 99.6%. Since the Hamming distance to identify individual places. Extracting a ConvNet feature over bit vectors is a computationally cheap operation, the from an image requires approximately 15 ms on a Nvidia best matching image among 10,000 candidates can be found QuadroK4000GPU.Thebottleneckoftheplacerecognition within 13.4 ms on a standard desktop machine. This cor- system is the nearest neighbor search that is based on the responds to a speed-up factor of 266 compared to using cosine distance between 64,896 dimensional feature vectors the cosine distance over the original conv3 features which – a computationally expensive operation. Our Numpy/Scipy required 3570 ms per 10,000 candidates. Calculating the implementation requires 3.5 seconds to find a match among hashes requires 180 ms using a non-optimized Python im- 10,000 previously visited places. As we shall see in the plementation. Table V summarizes the required time for the nextsection,twoimportantalgorithmicimprovementscanbe main algorithmic steps. We can see that the hashing enables introduced that lead to a speed-up of 2 orders of magnitude. real-time place recognition using ConvNet features on large scale maps of 100,000 places or more. V. REAL-TIMELARGE-SCALEPLACERECOGNITION In contrast to typical computer vision benchmarks where B. SearchSpacePartitioningusingSemanticCategorization the recognition accuracy is the most important performance We propose a novel approach to exploit the semantic metric, robotics applications depend on agile algorithms information encoded in the high-level ConvNet features to that can provide a solution under certain soft real-time partition the search space and constrain the nearest neigh- constraints. The nearest neighbor search is the key limiting bor search to areas of similar semantic place categories factor for large-scale place recognition, as its runtime is such as office, corridor, classroom, or restaurant. This can proportional to the number of stored previously visited significantly shrink the search space in semantically versa- places. In the following we will explore two approaches tile environments. To discriminate between semantic place that will decrease the required search time by two orders categories we train a nonlinear SVM classifier using the of magnitude with only minimal accuracy degradation. fc7 layer. The training has to be performed only once on images from the SUN-397 database [37]. When performing A. Locality Sensitive Hashing for Runtime Improvements place recognition, a separate index for each semantic class Computing the cosine distance between many 64,896 c is maintained that allows fast access to all stored conv3 i dimensionalconv3featurevectorsisanexpensiveoperation features that were recorded at previously visited places with and is a bottleneck of the ConvNet-based place recognition. the probability of belonging to c above a threshold θ. i To speed this process up significantly, we propose to use a The nearest neighbor search can then be limited to places specialized variant of binary locality-sensitive hashing that that belong to the same semantic category as the currently preserves the cosine similarity [35]. This hashing method observed place. We tested this approach on the Campus leverages the property that the probability of a random dataset and trained the classifier to discriminate between 11 hyperplane separating two vectors is directly proportional to semantic classes. Partitioning the search space decreased the time spent for the nearest neighbor search by a factor of 4. Fig. 10 shows the place recognition performance decreasing Variationsin Dataset Appearance Viewpoint BestLayer Nordlandseasons severe none conv3 GPday-rightvsnight-right severe none conv3 originalconv3 4096bits 8192bits GPday-leftvsnight-right severe medium conv3 ConvNetfeature 15ms 15ms 15ms St.Lucia medium medium conv3 hashing — 100ms 180ms Campus medium medium conv3 match100kcandidates 35,700ms 89ms 134ms GPday-leftvsday-right minor medium fc6 framerate 0.03Hz 4.9Hz 3.0Hz Nordlandsynthetic none varied fc6 TABLEV:Runtimecomparisonbetweenimportantalgorith- TABLE IV: The mid-level conv3 outperforms all other mic steps for the original features and two different hashes layers in the presence of significant appearance changes. on a desktop machine with a Nvidia Quadro K4000 GPU. Nordland: spring vs. winter Nordland: spring vs. winter 1.0 1.0 0.8 0.8 n n Precisio00..46 148001299462 bbbiiitttsss Precisio00..46 APllaecxeNse2t0 (54 0(49069 b6i tbsi)ts) 0.2 full feature 0.2 Hybrid (4096 bits) 0.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Recall Recall Gardens Point: day-left vs. day-right 1.0 256 bits Fig. 11: The Places205 and the Hybrid networks 0.8 512 bits from [17] perform slightly better at place recognition than n Precisio00..46 148001299462 bbbiiitttsss AhalsehxedNectonovn3thfeeaNtuorerdslawnedreSupsreindgfovrs.thWesienteexrpdeartiamseetn.tsT.he 0.2 full feature 0.0 0.0 0.2 0.4 0.6 0.8 1.0 Recall network perform slightly better under severe appearance changes. This could be explained by the fact that AlexNet Fig. 9: The cosine distance over the full feature vector istrainedforobjectrecognitionwhilethetwoothernetworks of 64896 floating point elements (red), can be closely ap- are specialized for recognizing scene categories, i.e. they proximated by the Hamming distance over bit vectors of length 213 (blue) without losing much performance. This learned to discriminate places. However, in the presence of viewpoint changes (Gardens Point left vs. right), the results corresponds to a compression of 99.6%. The bit vectors are inconclusive and AlexNet has a slight performance arecalculatedusingthecosinesimilaritypreservingLocality advantage. Sensitive Hashing method as proposed in [36]. VII. CONCLUSIONS Campus: human vs. robot with search space partitioning 1.0 Ourpaperpresentedathoroughinvestigationontheutility 0.8 of ConvNet features for the important task of visual place n Precisio00..46 rbeicnoegtnhietioinndiinvirdoubaoltisctsre.nWgethpsreosfetnhteedhaignho-lveevleml aenthdodmtiod-cleovme-l 0.2 conv3 conv3 with partition feature layers to partition the search space and recognize 0.00.0 0.2 0.4 0.6 0.8 1.0 places under severe appearance changes. We demonstrated Recall for the first time that large-scale robust place recognition using ConvNet features is possible when applying a special- Fig. 10: Partitioning the search space decreases the perfor- ized binary hashing method. Our comprehensive study on mance slightly but can significantly reduce the time spent four real world datasets highlighted the individual strengths for the nearest neighbor search (76% in this case). of mid- and high-level features with respect to the biggest challenges in visual place recognition – appearance and viewpoint changes. A comparison of three state-of-the-art slightly when using this technique. However, there is an ConvNets revealed slight performance advantages for the adjustable trade-off between recognition performance and networks trained for semantic place categorization. In sub- runtimerequirements.Loweringthethreshold(θ =10%was sequent work [33] we applied the insights gained in this used here) allows more candidate matches to be assessed, paper and extended the holistic approach presented here thus increasing recognition performance at the expense of to a landmark-based scheme that addresses the remaining the runtime. challenge of combined viewpoint and appearance change robustness. In future work we will investigate how training VI. COMPARINGDIFFERENTCONVNETSFORPLACE ConvNetsspecificallyforthetaskofplacerecognitionunder RECOGNITION changing conditions can improve their performance. All experiments described so far used the AlexNet ConvNet architecture [26] as implemented by Caffe [29] ACKNOWLEDGEMENTS that was pre-trained for the task of object recognition on the ThisresearchwasconductedbytheAustralianResearchCouncilCentre ILSVRC dataset [24]. While preparing this paper, [17] pub- ofExcellenceforRoboticVision(projectnumberCE140100016).Wewant lished the Places205 and Hybrid networks. These two tothankArrenGloverandWillMaddernforcollectingtheGardensPoint networkshavethesameprincipledarchitectureasAlexNet andSt.Luciadatasets. buthavebeentrainedforscenecategorization(Places205) or both tasks (Hybrid). We compare their place recogni- REFERENCES tion performance with that of AlexNet, using the hashed [1] M.MilfordandG.F.Wyeth,“SeqSLAM:Visualroute-basednaviga- conv3features.TableVIandFig.11summarizetheresults. tion for sunny summer days and stormy winter nights.” in Proc. of Compared to AlexNet, Places205 and the Hybrid Intl.Conf.onRoboticsandAutomation(ICRA),2012. PlaceRecognitionPerformance(F-Score)usingconv3 Nordland GardensPoint Pre-trained springvs. day-leftvs. day-rightvs. day-leftvs. Network forTask onDataset winter day-right night-right night-right AlexNet objectrecognition ImageNetILSVRC[24] 0.68 0.85 0.85 0.48 Places205 placecategorization PlacesDatabase[17] 0.71 0.82 0.84 0.46 Hybrid bothofabovetasks Places+ILSVRC[17] 0.71 0.84 0.87 0.44 TABLE VI: The three convolutional networks we compare in this paper. All of them are implemented using Caffe [29]. Notice that none of them is specifically trained for the task of place recognition. [2] P. Corke, R. Paul, W. Churchill, and P. Newman, “Dealing with [20] G.Sibley,L.Matthies,andG.Sukhatme,“SlidingWindowFilterwith shadows: Capturing intrinsic scene appearance for image-based out- Application to Planetary Landing,” J. Field Robotics, vol. 27, no. 5, doorlocalisation,”inIEEE/RSJInternationalConferenceonIntelligent 2010. RobotsandSystems(IROS),2013. [21] P. Neubert, N. Su¨nderhauf, and P. Protzel, “Appearance Change [3] P. Neubert, N. Su¨nderhauf, and P. Protzel, “Superpixel-based ap- PredictionforLong-TermNavigationAcrossSeasons,”inProceedings pearancechangepredictionforlong-termnavigationacrossseasons,” ofEuropeanConferenceonMobileRobotics(ECMR),2013. JournalofRoboticsandAutonomousSystems,2014. [22] W. Maddern, M. Milford, and G. Wyeth, “Continuous Appearance- [4] E.JohnsandG.-Z.Yang,“Featureco-occurrencemaps:Appearance- based Trajectory SLAM,” in International Conference on Robotics basedlocalisationthroughouttheday,”inProceedingsofInternational andAutomation(ICRA),2011. ConferenceonRoboticsandAutomation(ICRA),2013. [23] C. McManus, B. Upcroft, and P. Newman, “Scene signatures: Lo- [5] N. Su¨nderhauf, P. Neubert, and P. Protzel, “Are We There Yet? calised and point-less features for localisation,” in Proceedings of Challenging SeqSLAM on a 3000 km Journey Across All Four RoboticsScienceandSystems(RSS),Berkeley,CA,USA,July2014. Seasons.”inProceedingsofWorkshoponLong-TermAutonomy,IEEE [24] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, InternationalConferenceonRoboticsandAutomation(ICRA),2013. Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, [6] E.Pepperell,P.I.Corke,andM.J.Milford,“All-environmentvisual and L. Fei-Fei, “Imagenet large scale visual recognition challenge,” placerecognitionwithSMART,”inIEEEInternationalConferenceon InternationalJournalofComputerVision,2014. RoboticsandAutomation(ICRA),2014. [25] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, [7] T.Naseer,L.Spinello,W.Burgard,andC.Stachniss,“Robustvisual W.Hubbard,andL.D.Jackel,“Backpropagationappliedtohandwrit- robotlocalizationacrossseasonsusingnetworkflows,”inProceedings tenzipcoderecognition,”Neuralcomputation,vol.1,no.4,1989. oftheNationalConferenceonArtificialIntelligence(AAAI),2014. [26] A.Krizhevsky,I.Sutskever,andG.E.Hinton,“Imagenetclassification [8] B.Upcroft,C.McManus,W.Churchill,W.Maddern,andP.Newman, with deep convolutional neural networks,” in Advances in Neural “Lightinginvarianturbanstreetclassification,”inIEEEIntl.Conf.on InformationProcessingSystems25,2012. RoboticsandAutomation(ICRA),2014. [27] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and [9] W.ChurchillandP.M.Newman,“Practicemakesperfect?Managing Y.LeCun,“Overfeat:Integratedrecognition,localizationanddetection andleveragingvisualexperiencesforlifelongnavigation.”inProc.of usingconvolutionalnetworks,”arXivpreprintarXiv:1312.6229,2013. Intl.Conf.onRoboticsandAutomation(ICRA),2012. [28] J.Donahue,Y.Jia,O.Vinyals,J.Hoffman,N.Zhang,E.Tzeng,and [10] C. McManus, W. Churchill, W. Maddern, A. Stewart, and P. New- T.Darrell,“Decaf:Adeepconvolutionalactivationfeatureforgeneric man, “Shady dealings: Robust, long- term visual localisation using illuminationinvariance,”inProc.ofIEEEIntl.Conf.onRoboticsand visualrecognition,”arXivpreprintarXiv:1310.1531,2013. Automation(ICRA),2014. [29] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, [11] W.Maddern,A.Stewart,C.McManus,B.Upcroft,W.Churchill,and S.Guadarrama,andT.Darrell,“Caffe:ConvolutionalArchitecturefor P. Newman, “Illumination invariant imaging: Applications in robust FastFeatureEmbedding,”inProc.ofACMInternationalConference vision-based localisation, mapping and classification for autonomous onMultimedia.,2014. vehicles,”inProceedingsoftheVisualPlaceRecognitioninChanging [30] Z.Chen,O.Lam,A.Jacobson,andM.Milford,“ConvolutionalNeural EnvironmentsWorkshop,IEEEIntl.Conf.onRoboticsandAutomation Network-based Place Recognition,” in Proceedings of Australasian (ICRA),2014. ConferenceonRoboticsandAutomation(ACRA),2014. [12] S. Lowry, M. Milford, and G. Wyeth, “Transforming morning to [31] A. Glover, W. Maddern, M. Milford, and G. Wyeth, “FAB-MAP + afternoon usinglinear regressiontechniques,” inIEEEIntl. Conf. on RatSLAM:Appearance-BasedSLAMforMultipleTimesofDay,”in RoboticsandAutomation(ICRA),2014. Proc.ofIEEEIntl.Conf.onRoboticsandAutomation(ICRA),2010. [13] C. Stachniss and W. Burgard, “Particle filters for robot navigation,” [32] A.J.Glover,W.P.Maddern,M.J.Milford,andG.F.Wyeth,“Fab- FoundationsandTrendsinRobotics,vol.3,no.4,pp.211–282,2014. map+ ratslam: appearance-based slam for multiple times of day,” in [14] S. Lowry, G. Wyeth, and M. Milford, “Towards training-free RoboticsandAutomation(ICRA),2010IEEEInternationalConference appearance-based localization: probabilistic models for whole-image on. IEEE,2010,pp.3507–3512. descriptors,”inIEEEIntl.Conf.onRoboticsandAutomation(ICRA), [33] N. Su¨nderhauf, S. Shirazi, A. Jacobson, F. Dayoub, E. Pepperell, 2014. B.Upcroft,andM.Milford,“PlaceRecognitionwithConvNetLand- [15] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature marks:Viewpoint-Robust,Condition-Robust,Training-Free,”inProc. hierarchiesforaccurateobjectdetectionandsemanticsegmentation,” ofRobotics:ScienceandSystems(RSS),2015. in Proceedings of the IEEE Conference on Computer Vision and [34] M.D.ZeilerandR.Fergus,“Visualizingandunderstandingconvolu- PatternRecognition(CVPR),2014. tionalnetworks,”inComputerVision–ECCV2014,2014. [16] A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, “CNN [35] M.S.Charikar,“Similarityestimationtechniquesfromroundingalgo- Features off-the-shelf: an Astounding Baseline for Recognition,” in rithms,”inProc.ofAnnualACMSymposiumonTheoryofComputing, ComputerVisionandPatternRecognitionWorkshops(CVPRW),2014. 2002. [17] B.Zhou,A.Lapedriza,J.Xiao,A.Torralba,andA.Oliva,“Learning [36] D. Ravichandran, P. Pantel, and E. Hovy, “Randomized algorithms Deep Features for Scene Recognition using Places Database.” NIPS, and NLP: using locality sensitive hash function for high speed noun 2014. clustering,”inProc.oftheAnnualMeetingonAssociationforCom- [18] M.CumminsandP.Newman,“FAB-MAP:ProbabilisticLocalization putationalLinguistics,2005. andMappingintheSpaceofAppearance,”TheInternationalJournal [37] J. Xiao, K. A. Ehinger, J. Hays, A. Torralba, and A. Oliva, “Sun ofRoboticsResearch,vol.27,no.6,pp.647–665,2008. database: Exploring a large collection of scene categories,” Interna- [19] ——, “Appearance-only SLAM at large scale with FAB-MAP 2.0,” tionalJournalofComputerVision,pp.1–20,2014. TheInternationalJournalofRoboticsResearch,vol.30,no.9,2011.