ebook img

VINet: Visual-Inertial Odometry as a Sequence-to-Sequence Learning Problem PDF

1.6 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview VINet: Visual-Inertial Odometry as a Sequence-to-Sequence Learning Problem

VINet: Visual-Inertial Odometry as a Sequence-to-Sequence Learning Problem RonaldClark1,SenWang1,HongkaiWen2,AndrewMarkham1 and NikiTrigoni1 1DepartmentofComputerScience,UniversityofOxford,UnitedKingdom 2DepartmentofComputerScience,UniversityofWarwick,UnitedKingdom Email:{firstname.lastname}@cs.ox.ac.uk 7 1 Abstract Traditional Thresholds aMlgaotcrhitihnmg algFolroiwth m 0 In this paper we present an on-manifold sequence-to- Cam Feature Matching Optical 2 sequence learning approach to motion estimation us- Extraction Flow ing visual and inertial sensors. It is to the best of our Feature detection/ pr knowledge the first end-to-end trainable method for Calibration Tight Clock description Transform Synchronization algorithm A visual-inertial odometry which performs fusion of the 2 dmaetathaotdanhainstenrummeedrioautesfaedavtuarnet-argeepsreosveenrtattriaodnitlieovneal.lOaupr- IMU IMU Bias Filtering / Optimization Optimization Pose proaches.Specifically,iteliminatestheneedfortedious Parameters ] manualsynchronizationofthecameraandIMUaswell Filter V aseliminatingtheneedformanualcalibrationbetween VINet Error C the IMU and camera. A further advantage is that our modelnaturallyandelegantlyincorporatesdomainspe- Ground-Truth . s cificinformationwhichsignificantlymitigatesdrift.We Pose c showthatourapproachiscompetitivewithstate-of-the- Cam [ Pose arttraditionalmethodswhenaccuratecalibrationdatais IMU 2 availableandcanbetrainedtooutperformtheminthe presenceofcalibrationandsynchronizationerrors. v 6 7 Figure 1: Comparison between a standard visual-inertial Introduction 3 odometryframeworkandourlearning-basedapproach.Ele- 8 A fundamental requirement for mobile robot autonomy is mentsinblueneedtobespecifiedduringsetup.Theparam- 0 the ability to be able to accurately navigate where no GPS eters of VINet are hidden from the user and fully learned 1. signalsareavailable.Oneofthemostpromisingapproaches fromdata. 0 toachievingthisgoalisthroughthefusionofimagesfroma 7 monocularcameraandinertialmeasurementunit.Thissetup 1 hastheadvantagesofbeingbothcheapandubiquitousand, • Wepresentthefirstsystemforvisual-inertialaidednavi- : throughthecomplementarynatureofthesensors,hasthepo- gationthatisfullyend-to-endtrainable. v tentialtoprovideposeestimateswhichareon-parintermsof i • We introduce a novel recurrent network architecture and X accuracywithmoreexpensivestereoandLiDARsetups.As training procedure to optimally train the parameters of r suchmonocularvisual-inertialodometry(VIO)approaches model a havereceivedconsiderableattentionintheroboticscommu- • This includes a novel differentiable pose concatenation nity(Guietal.2015)andcurrentstate-of-the-artapproaches layer which allows the network’s predictions to conform toVIO(Leuteneggeretal.2015)areabletoachieveimpres- tothestructureoftheSE(3)manifold sive accuracy. However, these approaches still suffer from strictcalibrationandsynchronizationrequirements. • We evaluate our method and the demonstrate its advan- Inspired by the recent success of deep-learning models tagesovertraditionalmethodsonreal-worlddata for processing raw, high-dimensional data, we propose in this paper a fresh approach to VIO by regarding it as a Relatedwork sequence-to-sequenceregressionproblem.Theresultingap- In this section we briefly outline works strictly focused on proach,VINet,isafullytrainableend-to-endmodelforper- monocular cameras and inertial measurement unit data in forming visual-inertial odometry. Our contributions are as theabsenceofothersensorssuchasstereosetupsandlaser follows rangefinders. Copyright(cid:13)c 2016,AssociationfortheAdvancementofArtificial VisualOdometry:VOalgorithmsestimatetheincremental Intelligence(www.aaai.org).Allrightsreserved. ego-motion of a camera. A traditional VO algorithm, illus- tratedinFig.1a.,operatesbyextractingfeaturesinanim- failsmoreoftenthanOK-VIS.Wethereforecompareagainst age, matching the features between the current and succes- OK-VISinthispaperandcompareagainstMSCKFforro- siveimagesandthencomputingtheopticalflow.Themotion bustness. canthenbecomputedusingtheopticalflow.Thefastsemi- Deep-learning: Some deep-learning approaches have been directmonocularvisualodometry(SVO)algorithm(Forster, proposed for visual odometry, however, to the best of our Pizzoli,andScaramuzza2014) isanexampleofa state-of- knowledge,aneuralnetworkapproachhasneverbeenused the-art VO algorithm. It is designed to be fast and robust inanyformforend-to-endmonocularvisual-inertialodom- by operating directly on the image patches, not relying on etry.In(KondaandMemisevic2015),aStereo-VOmethod slow feature extraction. Instead, it uses probabilistic depth is presented where they extract motion by detecting “syn- filters on patches of the image itself. The depth filters are chronicity” across the stereo frames. (Costante et al. 2016) then updated through whole image alignment. This visual investigated the feasibility of using a CNN to extract ego- odometry algorithm is efficient and runs in real-time on an motionfromopticalflowframes.In(DeTone,Malisiewicz, embedded platform. Its probabilistic formulation, however, andRabinovich2016)thefeasibilityofusingaCNNforex- makesitdifficulttotuneanditalsorequiresabootstrapping tracting the homography relationship between frame pairs proceduretostarttheprocess.Asexpected,itsperformance was shown. Recently, (Rambach et al. 2016) use an RNN dependsheavilyonthehardwaretopreventtrackingfailures to fuse the output of a standard marker-based visual pose - typically global shutter cameras operating at higher than systemandIMU,therebyeliminatingtheneedforstatistical 50 fps needs to be used to ensure the odometry estimates filtering. remainaccurate. Background:RecurrentandConvolutional Visual-Inertial Odometry: Regardless of the algorithm, traditional monocular VO solutions are unable to observe Networks the scale of the scene and are subject to scale drift and a RecurrentNeuralNetworks(RNN’s)refertoageneraltype scaleambiguity.Dedicatedloopclosuremethods(eg.FAB- of neural network where the layers operate not only on MAP (Cummins and Newman 2008)) are integrated to re- the input data but also on delayed versions of the hidden duce scale drift. Scale ambiguity, however, cannot be re- layers and/or output. In this manner, the network has an solved using loop-closure and requires the integration of internalstatewhichitcanuseasa“memory”tokeeptrack external information. This usually takes the form of de- of past inputs and its corresponding decisions. RNN’s, tecting the scale objects in the scene (Castle, Klein, and however,havethedisadvantagethatusingstandardtraining Murray 2010) (Pillai and Leonard 2015) (Salas-Moreno et techniques they are unable to learn to store and operate al. 2013) or fusing information from an inertial measure- on long-term trends in the input and thus do not provide mentunit(IMU)tocreateavisual-inertialodometry(VIO) muchbenefitoverstandardfeed-forwardnetworks.Forthis setup. Fusing inertial and visual information not only re- reason,theLongShort-TermMemory(LSTM)architecture solves the scale ambiguity but also increases the accuracy was introduced to allow RNN’s to learn longer-term trends of the VO itself. In theory, the complementary nature be- (Hochreiter and Schmidhuber 1997). This is accomplished tween inertial measurements from an IMU and visual data through the inclusion of gating cells which allow the should enable highly accurate ego-motion estimation un- networktoselectivelystoreand“forget”memories. deranycircumstances:visuallocalizationtechniquesareen- tirelyreliantonobservingdistinctivefeaturesintheenviron- There are numerous variations of the LSTM architec- ment and in indoor situations these techniques are plagued ture. However, these have been shown to have similar by intermittent occlusions. On the other hand, the pose of performance on real world data (Zaremba 2015). The an IMU device can be tracked through double integration contents of the memory cell is stored in c . The input gate of the acceleration data or more complex schemes such as t i controls how the input enters into the contents of the the on-manifold pre-integration strategy of (Forster et al. t memory cell for the current time-step. The forget gate, f , 2015), and then integrated with a visual-odometry method t determines when the memory cell should be emptied by suchasSVOtoformaVIOsysteminwhichtheIMUdrift producing a control signal in the range 0 to 1 which clears remains constrained. In state-of-the-art systems, fusion is the memory cell as needed. Finally, the output gate o then achieved either through a filter-based or optimization- t determineswhetherthecontentsofthememorycellshould based procedure. Filter-based methods such as the Multi- beusedatthecurrenttime-step. state Constraint Kalman Filter (MSCKF) (Mourikis and Roumeliotis2007),althoughbeingmorerobust,consistently under-perform their optimization-based counterparts, such i =σ(W x +W h +W c +b ) (1) t xi t hi t−1 ci t−1 i as the Sliding Window Filter (SWF) (Sibley, Matthies, and f =σ(W x +W h +W c +b ) (2) t xf t hf t−1 cf t−1 f Sukhatme2008)andOpenKeyframeVISual-inertialodom- z =tanh(W x +W h +b ) (3) etry (OK-VIS) (Leutenegger et al. 2015) for accuracy. In t xc t hc t−1 c (Forsteretal.2015)itisshownthatasystemusingthepre- ct =ft(cid:12)ct−1+it(cid:12)zt (4) integration strategy slightly outperforms OK-VIS in terms o =σ(W x +W h +W c +b ) (5) t xo t ho t−1 co t o ofaccuracy.However,the(Forsteretal.2015)systemrelies h =o (cid:12)tanh(c ) (6) t t t oniSAM2(Kaessetal.2011)astheback-endoptimization and SVO as the front-end tracking system which we found WheretheweightsW andbiasesb fullyparameterise :,: : 5x5 1x1 LSTM IMU Visual-inertial Fusion by 6 RGB concatenating feature vectors t 3x3 LSTM IMU 6 Corr LSTM LSTM IMU RGBt+1 512 512 512 512 1024 SE3 Compseo3sition Layer Pose 6 473 256 t 64 128 256 5x5 1x1 VInitialised by flownet weights LSTM IMU RGB 6 t+1 3x3 LSTM IMU 6 Corr LSTM LSTM IMU RGBt+2 473 256 512 512 512 512 1024 SE3 Compsoes3ition Layer Poset+1 6 64 128 256 Figure2:TheproposedVINetarchitectureforvisual-inertialodometry.ThenetworkconsistsofacoreLSTMprocessingthe poseoutputatcamera-rateandanIMULSTMprocessingdataattheIMUrate. the operation of the network and are learned during train- ing.Inordertoprocesshigh-dimensionalinputdata(suchas VIO:(cid:8)(cid:0)RW×H,R6(cid:1) (cid:9)→(cid:8)(cid:0)R7(cid:1) (cid:9) (8) 1:N 1:N images), convolutional layers can be integrate in the RNN Where W ×H is the width and height of the input im- structure. At each convolutional layer, multiple convolu- ages and 1 : N are the timesteps of the sequence. We now tionaloperationsareappliedtoextractanumberoffeatures describethedetailedstructureofourVINetmodelwhichin- from the output map of the previous layer. The filter ker- tegratesthefollowingcomponents. nelswithwhichthemapsareconvolvedarelearnedduring training. Specifically, the output of a convolutional layer is SE(3)ConcatenationofTransformations computedas The pose of a camera relative to an initial starting point is conventionally represented as an element of the special (cid:88) (cid:32) (cid:88)P(cid:88)l−1Q(cid:88)l−1 (cid:33) EuclideangroupSE(3)oftransformations.SE(3)isadif- yx,y = act b + wp,q yx+p,y+q ferentiable manifold with elements consisting of a rotation l,f l,f l,f,m (l−1)m from the special orthogonal group SO(3) and a translation l,f m p=0 q=0 (7) vector, whereyx,y isthevalueoftheoutputoffeaturemapf of (cid:26)(cid:18) (cid:19) (cid:12) (cid:27) thel’thlayle,fratlocationx,y,b isaconstantbiastermalso T= R T (cid:12)(cid:12)R∈SO(3),T ∈R3 . (9) learnedduringtraining,wp,q l,ifsthekernelvalueatlocation 0 1 (cid:12) l,f,m p,qandP ,Q arethewidthandheightofthekernel. Producing transformation estimates belonging to SE(3) is i j not straightforward as the SO(3) component needs to be OurApproach an orthogonal matrix. However, the Lie Algebra se(3) of SE(3),representingtheinstantaneoustransformation, Our sequence-to-sequence learning approach to visual- (cid:26)(cid:18) (cid:19) (cid:12) (cid:27) inertialodometry,VINet,isshowninFig.2.Themodelcon- ξ = [ω]× v (cid:12)(cid:12)ω ∈so(3),v ∈R3 , (10) sistsofanCNN-RNNnetoworkwhichhasbeentailoredto dt 0 0 (cid:12) the task of visual-inertial odometry estimation. The entire can be described by components which are not subject to network is differentiable and thus trainable end-to-end for orthogonality constraints. Conversion between se(3) and the purpose of egomotion estimation. The input to the net- SE(3) is then easily accomplished using the exponential workismonocularRGBimagesandIMUdatawhichisa6 map dimensionalvectorcontainingthex,y,zcomponentsofac- celerationandangularvelocitymeasuredusingagyroscope. exp: se(3)→SE(3) (11) Theoutputofthenetworkisa7dimensionalvector-a3di- mensionaltranslationand4dimensionalorientationquater- In our network, a CNN-RNN processes the monocular nion-representingthechangeinposeoftherobotfromthe sequence of images to produce an estimate of the frame- startofthesequence.Inessence,ournetworklearnsthefol- to-framemotionundergonebythecamera.TheCNN-RNN lowing mapping which transforms input sequences of im- thusperformsthemappingfromtheinputdatatothelieal- agesandIMUdatatoposes gebrase(3).Anexponentialmapisusedtoconverttheseto thespecialeuclideangroupSE(3)wheretheindividualmo- Core-LSTMfusestheintermediatefeature-levelrepresenta- tions can then be composed in SE(3) to form a trajectory. tions of the visual and inertial data to produce a pose esti- In this manner, the function that the network needs to ap- mate. proximateremainsboundedovertimeastheframe-to-frame motion undergone by the camera is defined by the, albeit Computationalrequirements complex, dynamics of the platform over the course of the The computational requirements needed for odometry pre- trajectory.WiththeRNNweaimtolearnthecomplexmo- dictionandthestoragespacerequiredforthemodelaredi- tiondynamicsoftheplatformandaccountforsequentialde- rectly affected by the number of parameters used to define pendencieswhichareverydifficulttomodelbyhand. themodel.ForournetworkinFig.2,theparametersarethe weightmatricesoftheLSTMforboththeIMULSTMand theCoreLSTMaswellastheCNNnetworkwhichprocess the images. For our network we use LSTMs with 2 layers with cells of 1000 units. Our CNN total of 55M trainable weights.AforwardpassofimagesthroughtheCNNpartof thenetworktakesonaverage160ms(≈ 10Hz)onasingle Tesla k80. The LSTM updates are much less computation- RNN RNN RNN RNN allyexpensiveandcanrunat>200HzontheTeslak80. Figure 3: Illustration of the SE(3) composition layer - a Training parameter-freelayerwhichconcatenatestransformationsbe- The entire network is trained using Backpropagation tweenframesonSE(3). ThroughTime(BPTT).WeusestandardBPTTwhichworks byunfoldingthenetworkforaselectednumberoftimesteps, Furthermore, in the traditional LSTM model, the hidden T,andthenapplyingthestandardbackpropagationlearning stateiscarriedover tothenexttime-step,buttheoutput it- methodinvolvingtwopasses-aforwardpassandbackward selfisnotfedbacktotheinput.Inthecaseofodometryes- pass.IntheforwardpassofBPTT,theactivationsofthenet- timationtheavailabilityofthepreviousstateisparticularly work from Equations 1 to 6 are calculated successively for importantastheoutputisessentiallyanaccumulationofin- eachtimestepfromtimet = 1toT.Usingtheresultingac- cremental displacements at each step. Thus for our model, tivations, the backward pass proceeds from time t = T to wedirectlyconnecttheoutputposeproducedbytheSE(3) t = 1 calculating the derivatives of each output unit with concatenationlayer,backasinputtotheCoreLSTMforthe respecttothelayerinput(xl)andweightsofthelayer(wl). nexttimestep. Thefinalderivativesarethendeterminedbysummingover the time-steps. Stochastic Gradient Decent (SGD) with an Multi-rateLSTM RMSPropadaptivelearningrateisusedasthetoupdatethe Intheproblemofvisual-inertialodometrywearefacedwith weightsofthenetworksdeterminedbytheBPTT.SGDisa the challenge of the data streams being multi-rate i.e. the simpleandpopularmethodthatperformsverywellfortrain- IMU data often arrives at an order of magnitude (typically ingavarietyofmachinelearningmodelsusinglargedatasets 10×) faster (100 Hz) than the visual data (10 Hz). To ac- (BottouandBousquet2008).UsingSGD,theweightsofthe commodateforthisinourproposednetwork,weprocessthe networkareupdatedasfollows IMU data using a small LSTM at the IMU rate. The final ∂L(wl,x ) hidden-layer activation of the IMU-LSTM is then carried wl =wl−λ t (12) overtotheCore-LSTM. ∂wl wherewl representsaparameter(weightorbias)ofthenet- OpticalFlowWeightInitialization work indexed by l and the learning rate (λ), which deter- TheCNNtakestwosequentialimagesasinputand,similar mineshowstronglythederivativesinfluencetheweightup- totheIMULSTM,producesasinglefeature-vectordescrib- dates during each iteration of SGD. For all our training we ing the motion that the device underwent during the pass- selectthebestlearningrate. ing of the two frames which is used as input to the Core Training long, continuous sequences with the high- LSTM. We initially experimented with two frames fed di- dimensionalimagesasinputrequiresanexcessiveamountof rectlyintoaCNNpre-trainedontheimagenetdataset,how- memory.Toreducethememoryrequired,butstillkeepcon- ever, this showed incredibly slow training convergence and tinuity during training, we use the training structure where disappointing test performance. We therefore used as our thetrainingiscarriedoutoveraslidingwindowofbatches, baseanetworktrainedtopredictopticalflowfromRGBim- withthehiddenstateoftheLSTMcarriedoverbetweenwin- ages(Fischeretal.2015).OurCNNmimicsthestructureof dowsillustratedinFig.4.Finally,wefoundthattrainingthe FlownetuptotheConv6layer(Fischeretal.2015)wherewe networkdirectlythroughtheSE(3)accumulationisparticu- removedthelayerswhichproducethehigh-resolutionopti- larlydifficultasthetrainingproceduresuffersfrommanylo- calflowoutputandfeedinonlya1024×6×20vectorwhich calminima.Inordertoovercomethisdifficulty,weconsider weflattenandconcatenatewiththefeaturevectorproduced two losses, one based on the se(3) frame-to-frame (F-2-F) bytheIMU-LSTMbeforebeingfedtotheCoreLSTM.The predictions and the other on the SE(3) full concatenated 1 AscTecFireflyMAVwithafront-facingvisual-inertialsen- Batch 1 2 sorunitwithtightsynchronizationbetweenthecameraand 3 IMU timestamps. The images were captured by a global- 4 shutter camera at a rate of 20 Hz, and the acceleration and angular rate measurements from the IMU at 200 Hz. The 1 1 6-D ground-truth pose was captured using a Vicon motion 2 2 Batch N 3 3 capture system at 100 Hz. In order to provide an objective 4 4 comparison to the closest related method in the literature, the optimization-based OK-VIS (Leutenegger et al. 2015) Figure4:Batchstructureusedfortrainingonlongodometry methodisusedforcomparison.Asweareinterestedineval- sequences. uatingtheodometricperformanceofthemethods,noloop- closuresareperformed.Wetesttherobustnessofourmethod against camera-sensor calibration errors. We introduce cal- poserelativetothestartofthesequence.Thelosscomputed ibration errors by adding a rotation of a chosen magnitude fromtheF-2-Fposeis and random angle ∆R ∼ vMF(·|µ,κ) to the camera- SC (cid:88) IMUrotationmatricesRSC.ForourVINetwepresenttwo L =α ||ω−ωˆ||+β||v−vˆ|| (13) se(3) sets of results - one where we have augmented the training setbyartificiallymis-calibratedtrainingdataandonewhere ForfullconcatenatedposeinSE(3),weuseaquaternionic onlythecalibrateddatahasbeenusedtotrainthenetwork. representationfortheorientation,givingtheloss L =α(cid:88)||q−qˆ||+β||T −Tˆ|| (14) SE(3) OKVIS (Perfect Calibration) Ground Truth Weconsiderthreetypesoftraining;trainingonlytheLse(3) OOKKVVIISS ((26°°RRSSCCeerrrroorr)) loss, only the LSE(3) and joint training of both losses. The PPrrooppoosseedd ((6P°erRfeScCt eCrarolirb)ration) weightupdatesforthejointtrainingisshowninAlgorithm 4 1. 2 0 -2 Algorithm1Jointtrainingofse(3)andSE(3)loss -4 whilei≤n do -15 iter w1:n =w1:n−λ ∂LSE(3)(wl,xt) -10 1 ∂wl w1:j =w1:j −λ2∂Lse(∂3)w(wl l,xt) -5 5 10 endwhile 0 0 -5 5 -10 Wheren iseachtrainingiteration,w1:j arethetrain- iter ableweightsofthelayerstheSE(3)concatenationlayerand Figure 5: 6D MAV reconstructed trajectory using the pro- wi:n are the weights of all the layers in a network with n posedneuralnetworkcomparedtoOK-VIS(Leuteneggeret layers.Duringtrainingwestartwithahighrelativelearning al.2015). rate for the se(3) loss with λ /λ ≈ 100 and then reduce 2 1 thistoaverylowvalueλ /λ ≈0.1duringthelaterepochs 2 1 Fig. 5 shows the comparison of the estimated MAV tra- tofine-tunetheconcatenatedposeestimation. jectory by OK-VIS and VINet for various levels of mis- calibration.Itisevidentthatevenwhentrainedusingnoaug- Results mentation, the neural network degrades more gracefully in In this section we present results evaluating the proposed thefaceofmis-calibratedsensordata.Numericalresultsfor method in terms of accuracy and robustness to calibration andsynchronizationerrorsandprovidecomparisonstotra- Table1:RobustnessoftheVINettosensor-cameracalibra- ditionalmethods.Forourexperiments,weimplementedour tionerrors. model using the Theano library (Bergstra et al. 2010) and 0◦ 5◦ 10◦ 15◦ carried out all our training on a Tesla k80. We trained the VINet(no-aug) 0.1751 0.8023 1.94 3.0671 model for each dataset for 200 epochs, which took on av- erage 6 hours per dataset. The training process did not re- VINet(w/aug) 0.1842 0.1951 0.2218 0.5178 quireanyuserintervention,apartfromsettinganappropriate OK-VIS 0.1644 0.7916 1.9138 FAILS learningrate. therobustnesstestisshowninTable1.VINettrainedusing UAV:ChallengingIndoorTrajectory no augmentation performs competitively compared to OK- Wefirstevaluateourapproachonthepublicly-availablein- VIS and does not fail with high calibration errors. The re- door EuRoC micro-aerial-vehicle (MAV) dataset (Burri et sultsforVINettrainedusingcalibrationaugmentationshow al. 2016). The data for this dataset was captured using a asignificantlyhindereddecreaseinaccuracyasthecalibra- tion errors increase. This indicates that simply by training sequence 1-10 for training and 11 for testing. The monoc- the network using mis-calibrated data, it can be made ro- ular images and ground-truth are sampled at 10 Hz, while busttomis-calibrationerrors.Thispropertyisuniquetothe the IMU data is recorded at 100 Hz. Being recorded out- network-based approach and rather surprising as it is very doors on structured roads, this dataset exhibits negligible difficult,ifnotimpossible,toincreasetherobustnessoftra- motion blur and the trajectories follow very regular paths. ditionalapproachesinthismanner.Timesynchronizationis However,ithasdifferentcharacteristicscomparedtothein- another important calibration aspect which severely affects door EuRoC dataset which still make it very challenging. theperformanceoftraditionalmethods.WetestedVINetin Forexample,thevehiclepathcontainsmanysharpturnsand thisregardandfoundthatitcopeswithtime-synchronization clutteredfoliageareaswhichmakedataassociationbetween error even better than extrinsic calibration error. When the frames difficult for traditional visual odometry approaches. streams are entirely unsynchronized, the IMU data are ig- AstheKITTIdatasetdoesnotprovidetight-synchronization nored and the network resorts to vision-only motion esti- between the camera and IMU, we were unable to success- mation. The training performance in Fig. 6 shows the dif- fullyrunOK-VIS(Leuteneggeretal.2015)onthisdataset. ferencebetweentrainingsolelyontheF-2-Fdisplacements, For comparison, we instead mimic the system of (Weiss et solely on the full SE(3) pose and using our joint training al.2012)byfusingposeestimatesfromLIBVISO2(Geiger method.Theresultsshowthatjointtrainingallowsthenet- et al. 2013) with the inertial data using an EKF. We follow worktoconvergemorequicklytowardslow-errorestimates the standard KITTI evaluation metrics where we calculate over the trainingand validation sequences, while the F-2-F theerrorfor100m,200m,...,500msequences. trainingconvergesveryslowlyandtrainingonthefullpose convergestoahigh-errorestimate. )m c ( ro4 VINet (Vision only) FJorainmt -Ttorarianminge Training Only rre no VEKINFe+tV iso2 Full-pose Training Only ita2 ls n a rt F0 -2 100 200 300 400 500 -F Distance (m) )( ro°2 rre n 0 20 40 60 80 100 120 140 160 180 200 o1 Epoch (no.) ita to r F 500 500 500 -2-F0 100 200 300 400 500 400 400 400 Distance (m) 300 300 300 )°100 )m( Y 1200000 )m( Y 1200000 )m( Y 1200000 ( rorre n 50 -100 200 X (0m) -200 -100 200 X (0m) -200 -100 200 X (0m) -200 oitatne irO 0 Figure 6: Training performance of the network when train- 100 200 300 400 500 ing (1) only on the frame-to-frame transformations, (2) Distance (m) jointly on the SE(3) layer and frame-to-frame transforma- )m200 tion and (3) only on the concatenated pose. The training ( ro progress of the KITTI Seq-00 is shown for the best joint rre n100 training. o ita ls n a Autonomousdriving:StructuredOutdoor rT 0 100 200 300 400 500 Distance (m) Trajectory We further test the performance of VINet using the KITTI Figure 7: Translation and orientation errors on the KITTI odometry benchmark (Geiger, Lenz, and Urtasun 2012; dataset. Method A: VNet (only image data), Method B: Geiger et al. 2013). The KITTI dataset comprises of 11 VINet(visual-inertialdata),MethodC:Viso2 sequences collected from atop a passenger vehicle driving around a residential area with accurate ground-truth ob- Fig. 7 shows the translation and orientation errors ob- tainedfromaVelodynelaserscannerandGPSunit.Weuse tainedontheKITTIdataset.Asexpected,thevisual-inertial network(VINet)outperformsthenetworkusingvisualdata Forster, C.; Carlone, L.; Dellaert, F.; and Scaramuzza, D. alone. VINet also outperms the VISO2 method in terms of 2015. Imu preintegration on manifold for efficient visual- translational error, however it suffers somewhat from esti- inertialmaximum-a-posterioriestimation. InRobotics:Sci- matingorientationwheretheIMU-Viso2approachperforms enceandSystemsXI. better. The high translational accuracy of VINet compared Forster,C.;Pizzoli,M.;andScaramuzza,D.2014.Svo:Fast toitsorientationestimationcanpossiblybeattributedtoits semi-directmonocularvisualodometry.In2014IEEEInter- ability learn to predict scale from both the image data as- national Conference on Robotics and Automation (ICRA), well as the IMU data which is not possible in traditional 15–22. IEEE. approaches. Geiger,A.;Lenz,P.;Stiller,C.;andUrtasun,R.2013.Vision meetsrobotics:Thekittidataset. TheInternationalJournal ConclusionandFutureWork ofRoboticsResearch0278364913491297. InthispaperwehavepresentedVINet,anend-to-endtrain- Geiger, A.; Lenz, P.; and Urtasun, R. 2012. Are we ready able system for performing monocular visual-inertial aided for autonomous driving? the kitti vision benchmark suite. navigation.WehaveshownthatVINetperformson-parwith InComputerVisionandPatternRecognition(CVPR),2012 traditionalapproacheswhichrequiremuchhand-tuningdur- IEEEConferenceon,3354–3361. IEEE. ing setup. Compared to traditional methods, VINet has the Gui, J.; Gu, D.; Wang, S.; and Hu, H. 2015. A review of keyadvantageofbeingabletolearntobecomerobusttocal- visualinertialodometryfromfilteringandoptimisationper- ibrationerrors.WebelievethattheVINetapproachisafirst spectives. AdvancedRobotics29(20):1289–1301. step towards truly robust visual-inertial sensor fusion. For futureworkweintendtoinvestigatetheintegrationofVINet Hochreiter,S.,andSchmidhuber,J. 1997. Longshort-term intoalargersystemwithloop-closuresandmap-buildingas memory. Neuralcomputation9(8):1735–1780. wellastoperformamorein-depthanalysisofthemonocular Kaess, M.; Johannsson, H.; Roberts, R.; Ila, V.; Leonard, VOanditsabilitytodealwiththescaleprobleminabscene J.J.;andDellaert,F. 2011. isam2:Incrementalsmoothing oftheinertialdata. andmappingusingthebayestree.TheInternationalJournal ofRoboticsResearch0278364911430419. References Konda,K.,andMemisevic,R. 2015. Learningvisualodom- Bergstra,J.;Breuleux,O.;Bastien,F.;Lamblin,P.;Pascanu, etrywithaconvolutionalnetwork. InInternationalConfer- R.;Desjardins,G.;Turian,J.;Warde-Farley,D.;andBengio, enceonComputerVisionTheoryandApplications. Y. 2010. Theano:Acpuandgpumathcompilerinpython. Leutenegger, S.; Lynen, S.; Bosse, M.; Siegwart, R.; and InProc.9thPythoninScienceConf,1–7. Furgale, P. 2015. Keyframe-based visual–inertial odome- Bottou, L., and Bousquet, O. 2008. The tradeoffs of large tryusingnonlinearoptimization. TheInternationalJournal scalelearning.InPlatt,J.;Koller,D.;Singer,Y.;andRoweis, ofRoboticsResearch34(3):314–334. S., eds., Advances in Neural Information Processing Sys- Mourikis, A. I., and Roumeliotis, S. I. 2007. A multi- tems, volume 20. NIPS Foundation (http://books.nips.cc). state constraint kalman filter for vision-aided inertial navi- 161–168. gation.InProceedings2007IEEEInternationalConference Burri, M.; Nikolic, J.; Gohl, P.; Schneider, T.; Rehder, J.; onRoboticsandAutomation,3565–3572. IEEE. Omari,S.;Achtelik,M.W.;andSiegwart,R. 2016. Theeu- Pillai,S.,andLeonard,J. 2015. Monocularslamsupported rocmicroaerialvehicledatasets. TheInternationalJournal objectrecognition. arXivpreprintarXiv:1506.01732. ofRoboticsResearch0278364915620033. Rambach, J. R.; Tewari, A.; Pagani, A.; and Stricker, D. Castle,R.O.;Klein,G.;andMurray,D.W. 2010. Combin- 2016. Learningtofuse:Adeeplearningapproachtovisual- ingmonoslam withobjectrecognition forscene augmenta- inertialcameraposeestimation. 71–76. tionusingawearablecamera. ImageandVisionComputing Salas-Moreno,R.F.;Newcombe,R.A.;Strasdat,H.;Kelly, 28(11):1548–1556. P.H.;andDavison,A.J. 2013. Slam++:Simultaneouslo- Costante, G.; Mancini, M.; Valigi, P.; and Ciarfuglia, T. A. calisation and mapping at the level of objects. In Proceed- 2016. Exploring representation learning with cnns for ings of the IEEE Conference on Computer Vision and Pat- frame-to-frameego-motionestimation. IEEERoboticsand ternRecognition,1352–1359. AutomationLetters1(1):18–25. Sibley,G.;Matthies,L.;andSukhatme,G. 2008. Asliding Cummins,M.,andNewman,P. 2008. Fab-map:Probabilis- windowfilterforincrementalslam.InUnifyingperspectives ticlocalizationandmappinginthespaceofappearance.The incomputationalandrobotvision.Springer. 103–112. InternationalJournalofRoboticsResearch27(6):647–665. Weiss, S.; Achtelik, M. W.; Lynen, S.; Chli, M.; and Sieg- DeTone, D.; Malisiewicz, T.; and Rabinovich, A. 2016. wart, R. 2012. Real-time onboard visual-inertial state es- Deep image homography estimation. arXiv preprint timation and self-calibration of mavs in unknown environ- arXiv:1606.03798. ments. InRoboticsandAutomation(ICRA),2012IEEEIn- Fischer, P.; Dosovitskiy, A.; Ilg, E.; Ha¨usser, P.; Hazırbas¸, ternationalConferenceon,957–964. IEEE. C.; Golkov, V.; van der Smagt, P.; Cremers, D.; and Brox, Zaremba, W. 2015. An empirical exploration of recurrent T. 2015. Flownet:Learningopticalflowwithconvolutional network architectures. Journal of Machine Learning Re- networks. arXivpreprintarXiv:1504.06852. search.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.