ebook img

Transferring Rich Feature Hierarchies for Robust Visual Tracking PDF

2 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Transferring Rich Feature Hierarchies for Robust Visual Tracking

Transferring Rich Feature Hierarchies for Robust Visual Tracking NaiyanWang† SiyiLi† AbhinavGupta‡ Dit-YanYeung† † HongKongUniversityofScienceandTechnology ‡ CarnegieMellonUniversity [email protected] [email protected] [email protected] [email protected] 5 Abstract 1 0 2 Convolutional neural network (CNN) models have r demonstrated great success in various computer vision p tasks including image classification and object detection. A However,someequallyimportanttaskssuchasvisualtrack- ing remain relatively unexplored. We believe that a major 3 2 hurdlethathinderstheapplicationofCNNtovisualtrack- ingisthelackofproperlylabeledtrainingdata.Whileexist- ] ingapplicationsthatliberatethepowerofCNNoftenneed V an enormous amount of training data in the order of mil- C lions, visual tracking applications typically have only one . s labeled example in the first frame of each video. We ad- c dressthisresearchissueherebypre-trainingaCNNoffline [ andthentransferringtherichfeaturehierarchieslearnedto 2 online tracking. The CNN is also fine-tuned during online v PixelTracker TGPR SO−DLT tracking to adapt to the appearance of the tracked target 7 Figure 1. Tracking results for motocross1 and skiing video se- specified in the first video frame. To fit the characteristics 8 quences(SO-DLTisourproposedtracker). 5 of object tracking, we first pre-train the CNN to recognize 4 what is an object, and then propose to generate a proba- 0 bilitymapinsteadofproducingasimpleclasslabel. Using . is usually very limited, often with only one labeled exam- 1 two challenging open benchmarks for performance evalu- pleastheobjecttotrackspecifiedinthefirstframeofeach 0 ation, our proposed tracker has demonstrated substantial video.Thismakesdirectapplicationofthelarge-scaleCNN 5 improvementoverotherstate-of-the-arttrackers. 1 approachinfeasible. Inthispaper, wepresentanapproach : which can address this challenge and hence can bring the v CNN framework to visual tracking. Using this approach i 1.Introduction X to implement a tracker, we achieve very promising perfor- r Thepastfewyearshavebeenveryexcitinginthehistory mancewhichoutperformsthebeststate-of-the-artbaseline a of computer vision. A great deal of excitement has been tracker by more than 10% (see Fig. 1 for some qualitative reported when applying the biologically-inspired convolu- trackingresults). tional neural network (CNN) models to some challenging Although visual tracking can be formulated in different computer vision tasks. For example, breakthrough perfor- settingsaccordingtodifferentapplications,thefocusofthis mance has been reported for image classification [20] and paperistheone-passmodel-freesingleobjecttrackingset- objectdetection[10]tasks. However,someothercomputer ting. Specifically, itassumesthattheboundingboxofone visiontaskssuchasvisualtrackingremainrelativelyunex- single object in the first frame is given but no other ap- ploredinthisrecentsurgeofresearchinterest. Webelieve pearancemodelisavailable. Giventhissingle(labeled)in- thatamajorreasonisthelackofsufficientlabeledtraining stance,thegoalistotrackthemovementoftheobjectinan datawhichusuallyplaysaveryimportantroleinachieving online manner. Consequently, this setting involves adapt- breakthrough performance for other applications because ing the tracker to appearance changes of the object based CNNtrainingistypicallydoneinafullysupervisedmanner. on the possibly noisy output of the tracker. Another way Inthecaseofvisualtracking,however,labeledtrainingdata to formulate this problem would be as a self-taught one- 1 shot learning problem in which the single example comes catetheprobabilitythateachpixelintheinputimage from the previous frame. Since learning a visual model belongstotheboundingboxofanobject. Thekeyad- fromasingleexampleisanill-posedproblem,asuccessful vantagesofthepixel-wiseoutputareitsinducedstruc- approach would require using some auxiliary data to learn turedlossandcomputationalscalability. an invariant representation of generic object features. Al- 3. We evaluate our proposed method on an open bench- thoughsomerecentwork[31,33]alsosharesthisspirit,the mark [34] as well as a challenging non-rigid object performancereportedisinferiortothestateoftheartdueto trackingdatasetandobtainveryremarkableresults. In thelackofsufficienttrainingdataononehandandthelim- particular, we improve the area under curve (AUC) itedrepresentationalpowerofthemodelusedontheother metric of the overlap rate curve from 0.529 to 0.602 hand. CNN has a role to play here by learning more ro- fortheopenbenchmark. bustfeatures. Tomakeitfeasiblewithlimitedtrainingdata duringonlinetracking,wepre-trainaCNNofflineandthen 2.RelatedWork transfer the generic features learned to the online tracking task. 2.1.DeepLearningandCNNs The first deep learning tracker (DLT) [31] reported in Therootofdeeplearningcanbedatedbacktoresearch the literature is based on a stacked denoising autoencoder on multilayered neural networks in the late 1980s. The network. While this approach isvery promising, the exact resurgence of research interest in neural networks owes to realization of the approach reported in the paper has two a more recent work [17] which used pre-training to make limitationsthathinderthetrackingperformanceofDLTas the training of deeper networks feasible. Among differ- compared to other state-of-the-art trackers. First, the pre- ent deep learning models, CNN seems to be a more suit- trainingofDLTmaynotbeverysuitablefortrackingappli- able choice for many vision tasks as the design of CNN cations.Thedatausedforpre-trainingisfromthe80MTiny has been inspired by the vision systems of the biological Images dataset [29] with each image obtained by down- counterparts.Amongitscharacteristics,theconvolutionop- samplingdirectlyfromafull-sizedimage. Althoughsome eration can capture local and repetitive similarity and the generic image features can be learned by learning to re- pooling operation can allow local translational invariance construct the input images, the target to track in a typical in images. The rapid development of powerful computing trackingtaskisasingleobjectratherthananentireimage. devicessuchasgeneral-purposegraphicsprocessingunits Featuresthatareeffectivefortrackingshouldbeabletodis- (GPGPU)andtheavailabilityoflarge-scalelabeleddatasets tinguishobjectsfromnon-objects(i.e.background),notjust suchasImageNet[7]havemadethetrainingoflarge-scale toreconstructanentireimage. Second,ineachframe,DLT CNN possible. It has been demonstrated visually in [35] first generates candidates or proposals of the target based thataCNNcangraduallylearnlow-leveltohigh-levelfea- on the predictions of the previous frames, and then treats turesthroughthetransformationandenlargementofrecep- tracking as a classification problem. It ignores the struc- tive fields in different layers. As opposed to using hand- tured nature of bounding boxes in that a bounding box or craftedfeaturesasintheconventionalrecognitionpipeline, segmentation result corresponds to a region of an image, ithasbeendemonstratedthatthefeatureslearnedbyalarge- notjustasimplelabelorrealnumberasinaclassificationor scaleCNNcanachieveverysuperiorperformanceinsome regression problem. Some previous work [14, 32] showed high-levelvisiontaskssuchasimageclassification[20]and thatexploitingthestructurednatureexplicitlyinthemodel objectdetection[10]. couldimprovetheperformancesignificantly. Moreover,the numberofproposalsisusuallyintheorderofseveralhun- 2.2.VisualTracking dreds, making it hard to apply larger and more powerful Many methods have been proposed for single object deeplearningmodels. tracking. For a systematic review and comparison, we re- WeproposeanovelstructuredoutputCNNwhichtrans- ferthereaderstoarecentsurveyandabenchmark[27,34]. fersgenericobjectfeaturesforonlinetracking. Thecontri- Most of the existing tracking methods belong to the gen- butionsofourpaperaresummarizedasfollows: eralframeworkofBayesiantracking[1]. Itdecomposesthe 1. To alleviate the overfitting and drifting problems dur- problem into two parts which involve a motion model and ing online tracking, we pre-train the CNN to distin- an appearance model. Although some trackers attempt to guish objects from non-objects instead of simply re- gobeyondthisframework,e.g.[21,22,23],moststillfocus constructing the input or performing categorical clas- on improving the appearance model because this aspect is sification on large-scale datasets with object-level an- crucialtoenhancementinperformance. notations[7]. Generally speaking, most trackers belong to either one of two categories: generative trackers and discriminative 2. The output of the CNN is a pixel-wise map to indi- trackers. Generative trackers usually assume a generative processofthetrackedtargetandsearchforthemostprob- twoCNNsworkcollaborativelyindeterminingthetracking able candidate as the tracking result. Some representative resultofeachvideoframe. methods are based on principal component analysis [26], 3.2.ObjectnessPre-training sparse coding [25], and dictionary learning [30]. On the other hand, discriminative trackers learn to separate the ThearchitectureofthestructuredoutputCNNisshown foreground from the background using a classifier. Many inFig.2. Itconsistsofsevenconvolutionallayersandthree advancedmachinelearningalgorithmshavebeenused, in- fully connected layers. Between these two parts, a multi- cludingboostingvariants[12,13],multiple-instancelearn- scalepoolingscheme[15]isintroducedtoretainmorefea- ing[2],structuredoutputSVM[14],andGaussianprocess turesrelatedtolocalitysincetheoutputneedsthemforlo- regression [9]. These two approaches are in general com- calization. The parameter setting of the network is shown plementary. Discriminative trackers are usually more re- in Fig. 2. In contrast to the conventional CNN used for sistant to cluttered background since they explicitly sam- classification or regression, there is a crucial difference in ple image patches from the background as negative train- ourmodel: theoutputoftheCNNisa50×50probability ing examples. On the other hand, generative trackers are map rather than a single number. Each output pixel corre- usually more accurate under normal situations. Besides, spondstoa2×2regionintheoriginalinput,withitsvalue some methods exploit correlation filters for the target or representingtheprobabilitythatthecorrespondinginputre- context [4, 16, 37]. Their primary advantage is that only gionbelongstoanobject.Inourimplementation,theoutput fast Fourier transform and several matrix operations are layerisa2500-dimensionalfullyconnectedlayerwhichis needed, making them very suitable for real-time applica- thenreshapedtothe50×50probabilitymap. Sincethere tions. Moreover, some methods take the ensemble learn- existsstrongcorrelationbetweenneighboringpixelsofthe ingapproach[3,36,32]whichisespeciallyeffectivewhen probabilitymap,weonlyuse512hiddenunitsintheprevi- theconstituenttrackersinvolvedintheensemblehavehigh ouslayertohelppreventoverfitting. diversity. Furthermore, some methods focus on long-term To train such a large CNN, it is essential to use a large tracking,e.g.[19,28]. dataset to prevent overfitting. Since we are interested in Asforapplyingdeeplearningtovisualtracking,besides object-level features, we use the ImageNet 2014 detection the DLT [31] mentioned in the previous section, some re- dataset1 which contains 478,807 bounding boxes in the centmethodsincludeusinganensemble[39]andmaintain- trainingset. Foreachannotatedboundingbox,weaddran- ingapoolofCNNs[24]. However, duetothelackofsuf- dompaddingandscalingaroundit. Wealsorandomlysam- ficient training data, these methods only show comparable plesomenegativeexampleswhentheoverlaprates2 ofthe or even inferior results compared to other state-of-the-art positiveexamplesarebelowacertainthreshold.Notethatit trackers. In summary, for visual tracking applications, we doesnotlearntodistinguishdifferentobjectclassesasina believe that the power of deep learning has not yet been typicalclassificationordetectiontask,sinceweareonlyin- fullyliberated. terestedinlearningtodifferentiateobjectsfromnon-objects inthisstage. Consequently,weuseanelement-wiselogis- 3.OurTracker ticregressionmodelineachpositionofthe50×50output mapanddefinethelossfunctionaccordingly. Forthetrain- In this section, we will present our structured output ingtarget,apixelinsidetheboundingboxissetto1while deeplearningtracker(SO-DLT).WefirstpresenttheCNN it is 0 outside. As for a negative example, the target is 0 architectureinSO-DLTandtheofflinepre-trainingprocess fortheentireprobabilitymap. Thissettingisequivalentto oftheCNN.Wethenpresentdetailsoftheonlinetracking penalizing the number of mismatched pixels between the process. predictionandthegroundtruth, thusinducingastructured lossfunctionwhichfitstheproblembetter. Mathematically, 3.1.Overview letp denotesthepredictionof(i,j)position,andt isa ij ij binary variable denotes the ground truth of (i,j) position, Training of the tracker can be divided into two stages, thelossfunctionofourmethodisdefinedas: theofflinepre-trainingstageandtheonlinefine-tuningand tracking stage. In the pre-training stage, we train a CNN (cid:88)50 (cid:88)50 min −(1−t )log(1−p )−t log(p ). (1) to learn generic object features for distinguishing objects ij ij ij ij pij from non-objects, i.e., to learn from examples the notion i=1j=1 of objectness. Instead of fixing the learned parameters of The detailed parameters for training are described in CNNduringonlinetracking, wefine-tunethemsothatthe Sec.4.1. CNNcanadapttothetargetbeingtracked. Forrobustness, 1http://image-net.org/challenges/LSVRC/2014/ we run two CNNs concurrently during online tracking to 2Theoverlapratebetweentwoboundingboxesisdefinedastheareaof accountforpossiblemistakescausedbymodelupdate. The intersectionofthetwoboundingboxesovertheareaoftheirunion. Figure2.ArchitectureoftheproposedstructuredoutputCNN. Fig.3showssomeresultswhenthepre-trainedCNNis linemodeladaptation,isanindispensablepartofourtracker tested on the held-out validation set provided by the Ima- rather than an optional feature solely introduced to further geNet 2014 detection task. In most cases, the CNN can improvethetrackingperformance. successfullydeterminewhethertheinputimagecontainsan We now present the basic online tracking pipeline. We object, and if yes it can accurately locate the object of in- maintain two CNNs which use different model update terest. Note that since the labels of our training data are strategies. Afterfine-tuningusingtheannotationinthefirst only bounding boxes, the output of the 50×50 probabil- frame, we crop some image patches from each new frame itymapisalsointheformofasquare. Althoughthereare based on the estimation of the previous frame. By mak- methods [6] that utilize bounding box information to pro- ingasimpleforwardpassthroughtheCNN,wecanobtain vide weak supervision and obtain pixel-wise segmentation the probability map for each of the image patches. The fi- as well, we believe that the probability map output in our nalestimationisthendeterminedbysearchingforaproper modelissufficientforthepurposeoftracking. boundingbox. ThetwoCNNsareupdatedifnecessary. We illustrate the pipeline of the tracking algorithm in Fig. 4. In what follows, we will elaborate the major steps of the pipelineseparately. 3.3.1 BoundingBoxDetermination Whenanewframecomes,thefirststepofourtrackeristo determinethebestlocationandscaleofthetarget. Wefirst specifythepossibleregionsthatmaycontainthetargetand feed the regions into the CNN. Next, we decide the most probablelocationoftheboundingboxbasedontheproba- Figure3.Testingofthepre-trainedobjectnessCNNontheIma- geNet2014detectionvalidationset.Thefirstrowshowstwopos- bilitymap. itiveexampleseachofwhichcontainsanobject. Theobjectness Search Mechanism: Selecting a proper search range for CNNcanaccuratelydetecttheobjectpositionandscale.Thesec- the target is a nontrivial problem. Using too small search ondrowshowstwonegativeexamples.TheobjectnessCNNdoes regionsmakesiteasytolosetrackofatargetunderfastmo- notfireonthemshowingthelackofevidencefortheexistenceof tion,butusingtoolargesearchregionsmayincludesalient anyobjectofinterest.TheCNNplaysanimportantroleinmaking distractors in the background. For example, in Fig. 5, the ourSO-DLTrobusttoocclusionandclutteredbackgroundduring outputresponsegetsweakerasthesearchregionisenlarged onlinetracking. mainlyduetotheclutteredbackgroundandanotherperson nearby. To address this issue, we propose a multi-scale search scheme for determining the proper bounding box. 3.3.OnlineTracking First,allthecroppedregionsarecenteredattheestimation TheCNNpre-trainedtolearngenericobjectfeaturesas of the previous frame. Then, we start searching with the describedabovecannotbeuseddirectlyforonlinetracking smallestscale. Ifthesumovertheoutputprobabilitymapis becausethedatabiasoftheImageNetdataisdifferentfrom belowathreshold(i.e.,thetargetmaynotbeinthisscale), thatofthedataobservedduringonlinetracking. Moreover, thenweproceedtothenextlargerscale. Ifwecannotfind if we do not fine-tune the CNN, it will fire on all objects theobjectinallscales,wereportthatthetargetismissing. that appear in a video frame instead of just the object be- Generating Bounding Box: After we have selected the ing tracked. Therefore, it is essential to fine-tune the pre- best scale, we need to generate the final bounding box for trainedCNNusingtheannotationinthefirstframeofeach the current frame. We first determine the center of the videocollectedduringonlinetrackingtomakesurethatthe bounding box and then estimate its scale change with re- CNNisspecifictothetarget. Notethatfine-tuning, oron- specttothepreviousframe. Todeterminethecenterweuse Figure4.Pipelineofourtrackingalgorithm. a density based method, which sets a threshold τ for the pearancechangeswhileCNN isresistanttopotentialmis- 1 L corresponding probability map and finds a bounding box takes. Thefinalestimationisthendeterminedbythemore withallprobabilityvaluesinsideabovethethreshold. Next, confidentone. Consequently,thefinalintegratedresultsare the bounding box location under the current scale is esti- more robust to drifting caused by occlusion or a cluttered matedbytakinganaverageoverthedifferentvaluesofτ . background. Although there exist more advanced model 1 After the center is determined, we need to search again in updatemethods, e.g.[36], wefindthatthissimplescheme the corresponding region to find a proper scale. The scale worksquitewellinpractice. isaimedatfittingtheexacttargetregionperfectly. Simply We now provide more details about the update strate- usingtheaverageconfidence(whichmakesthetrackerpre- gies. We first observe that if a model is updated immedi- ferthecentralareawithhighconfidence)ortotalconfidence ately as soon as the prediction falls below a threshold, the (whichmakesitpreferthewholeframe)isnotsatisfactory. model will be easily influenced by noisy results. On the Let P denote the output probability map and p the otherhand,wefindthatthequalityofnegativeexamplesis ij (i,j)th element in P. We consider a bounding box with generallyquitestable. Asaresult,CNN isupdatedwhen S top-left corner (x,y), width w and height h. Its score is thereexistsanegativeexamplesuchthat calculatedas 50 50 (cid:88)(cid:88) x+w−1y+h−1 pij >τ2. (3) (cid:88) (cid:88) c= (pij −(cid:15))·w·h, (2) i=1j=1 i=x j=y Thisistomakesurethatanybackgroundobjectthatmakes where (cid:15) balances the scale of the bounding box. We also the CNN fire should be suppressed. Doing so will reduce repeat with several (cid:15) values and average their results for thechancethatthetrackerwilldrifttowardssomenegative robust estimation. The confidence can be calculated very examples similar to the tracked object when the following efficientlywiththehelpofintegralimages[5]. frames are processed. On the contrary, in addition to the abovecondition,CNN willonlybeupdatedif L 3.3.2 Differentially-pacedFine-tuning x+w−1y+h−1 (cid:88) (cid:88) p >τ ·w·h, (4) ij 3 Model update in visual tracking often faces a dilemma. i=x j=y If the tracker does not update frequently enough, it may not adapt well to appearance changes. However, if it up- where(x,y,w,h)denotestheoutputtargetboundingboxin dates too frequently, inaccurate results may impair its per- thecurrentframe. ThismeansthatweupdateCNN more L formanceandleadtothedriftingproblem. conservatively in that we only update it if we are highly We tackle this dilemma by using two CNNs during on- confident about the results in the current frame. Doing so linetracking. ThebasicideaistomakeoneCNN(CNN ) willreducetheriskofincorrectupdatewhenthetruetarget S account for short-term appearance while the other one hasalreadydriftedtothebackground. (CNN ) for long-term appearance. First, both CNNs are In each update, we need to collect both positive and L fine-tunedinthefirstframeofavideo. Afterwards,CNN negative examples. Our sampling scheme is illustrated in L is tuned conservativelywhile CNN is tuned aggressively. Fig.5.Forpositiveexamples,wesampletheminfourscales S By working collaboratively, CNN adapts to dramatic ap- based on the estimation of the previous frame. Random S translationisalsointroducedtoeliminatethelearningbias with a dropout rate of 0.5. During fine-tuning, we use a to the center location. As for negative examples, we crop larger learning rate of 2×10−7with a smaller momentum eightnon-overlappingboundingboxesaroundthetargetin of 0.5. For the first frame, we fine-tune each CNN for 20 differentdirectionsintwoscales. Theoutputofthepositive iterations.Forsubsequentframes,weonlyfine-tuneforone examplesisalsoshowninFig5. iteration. τ ranges from 0.1 to 0.7 with a step size 0.05. The 1 thresholdofsumofconfidenceτ fornegativeexamplesis 2 settoτ = 100. TheupdatingthresholdofCNN issetto 2 L τ = 0.8. The normalized constant (cid:15) for searching proper 3 scalesrangesfrom0.55to0.6withastepsizeof0.025. 4.2.CVPR2013VisualTrackerBenchmark The CVPR2013 Visual Tracker Benchmark [34] con- (a) (b) tains 50 fully annotated sequences and covers a variety of Figure5.Samplingschemeoftheproposedtracker. Ontheleft, challengingscenariosinthetrackingliteratureoverthepast the red bounding box denotes the target to track while the eight fewyears. blueonesarounditarenegativeexamples. Ontheright,weshow in the upper part positive examples fed into the CNN. They are paddedwithdifferentscalesandrandomtranslations. Thelower part shows the corresponding output of the CNN after applying fine-tuningtothisframe. 4.Experiments In this section, we empirically validate the proposed SO-DLTtrackerbycomparingitwithotherstate-of-the-art Figure6.PlotsofOPEontheCVPR2013VisualTrackerBench- mark. The performance score for each tracker is shown in the trackers.Forfaircomparison,notonlydoweneedareason- legend. For success plots the score is the AUC value while for ablylargebenchmarkdatasettoavoidbiasduetodataselec- precisionplotsthescoreistheprecisionvalueatthreshold20. tion,butthereshouldalsobeacarefullydesignedprotocol thatisfollowedbyeverytracker. Arecentwork[34]intro- ducedaunifiedtrackingbenchmarkwhichincludesbotha dataset and a protocol. We use the benchmark dataset for our comparative study and strictly follow the protocol by fixing the same set of parameters for all video sequences tested. Wewillmakeourimplementationpubliclyavailable ifthepaperisaccepted. 4.1.ImplementationDetails ThepartrelatedtoCNNisimplementedusingtheCaffe toolbox 3 and the wrapper for online tracking is imple- mented directly in MATLAB. All the experiments are run (a) performance ranking scores(b) performance ranking scores on a desktop computer with a 3.40GHz CPU and a K40 basedonsuccessplots basedonprecisionplots GPU. The speed of our unoptimized code is about 4 to 5 Figure 7. Average performance ranking scores of five leading framespersecond. trackersondifferentsubsetsoftestsequencesinOPE.Eachsubset Forthepre-trainingofCNN,westartwithalearningrate ofsequencescorrespondstooneoftheattributes,namely,illumi- of10−7 withmomentum0.9anddecreasethelearningrate nationvariation(IV),out-of-planerotation(OPR),scalevariation once every 5 epochs. We train for about 15 epochs in to- (SV), occlusion (OCC), deformation (DEF), motion blur (MB), fastmotion(FM),in-planerotation(IPR),out-of-view(OV),back- tal. Notethatourlearningrateismuchsmallerthantypical groundcluttered(BC),andlowresolution(LR).Thenumberafter choicesduetothedifferentlossfunctionusedbyus. Toal- eachattributenameisthenumberofsequencesinthecorrespond- leviate overfitting, a weight decay of 5×10−4 is used for ingsubset. Thetrackersincludedhereareselectedbasedontheir eachlayerandthefirstfullyconnectedlayerisregularized overallperformancerankingscoresinOPE. 3http://caffe.berkeleyvision.org 4.2.1 EvaluationSettingandMetrics [email protected] [email protected] [email protected] SO-DLT 0.8576 0.7798 0.4732 We use two performance measures analogous to the area TGPR 0.7690 0.6463 0.3767 under curve (AUC) measure for the receiver operating SCM 0.6807 0.6162 0.4396 characteristic (ROC) curve. Specifically, for a given over- Struck 0.6694 0.5593 0.3543 lapthresholdin[0,1], atrackerisconsideredsuccessfulin TLD 0.6362 0.5210 0.2997 a frame if its overlap rate exceeds the threshold. The suc- DLT 0.5912 0.5066 0.3581 ASLA 0.5730 0.5112 0.3884 cess rate for a video measures the percentage of success- ful frames over the entire video. By varying the threshold Table1.Successratesatdifferentthresholdsbasedontheoverlap gradually from 0 to 1, it gives a plot of the success rate ratemetricfordifferenttrackingmethods. against the overlap threshold for each tracker. A similar performance measure called precision plot is defined for p@15 p@25 p@35 thecentralpixelerrorwhichmeasuresthedistanceinpixels SO-DLT 0.7721 0.8470 0.8772 between the centers of the bounding boxes for the ground TGPR 0.7133 0.7907 0.8215 truthandtheprediction. Thedifferenceisthattheprecision Struck 0.6047 0.6895 0.7257 at threshold 20 is used for the performance score instead SCM 0.6171 0.6703 0.6980 of the AUC score as in the success plots. The results of TLD 0.5520 0.6403 0.6848 29 trackers on the benchmark can be found in [34]. For a DLT 0.5400 0.6128 0.6541 morecompletecomparison,wealsoincludearecenttracker ASLA 0.5050 0.5554 0.5915 calledTGPR[9]. Asfarasweknow,TGPRisthebestsin- Table2.Successratesatdifferentthresholdsbasedonthecentral- gletrackerwithcodepubliclyavailable. pixelerrormetricfordifferenttrackingmethods. 4.2.2 QuantitativeResults Due to space limitations, we only show in each plot under almost all conditions. More precisely, our proposed the overall performance of one pass error (OPE) for our trackergetsthehighestscoreforall11attributes. Besides, proposed tracker and some of the state-of-the-art track- wecanalsoseethatSO-DLTisespeciallygoodathandling ers, which are TGPR [9] and the top 5 trackers us- someattributegroupssuchas“illuminationvariation(IV)”, ing the CVPR2013 benchmark [34], namely, Struck [14], “out-of-plane rotation (OPR)”, “in-plane rotation (IPR)”, SCM [38], TLD [19], ASLA [18], and DLT [31]. The “out-of-view (OV)”, and “scale variation (SV)”. The se- success and precision plots are shown in Fig. 6. For each quencesinthesegroupsoftensufferfromlargeappearance tracker, a curve is obtained by averaging over those for all changes. Thispromisingresultdemonstratesthegreatrep- 50 test sequences. From Fig. 6, we can see that SO-DLT resentational power of CNN in detecting the target object outperforms the other trackers by a large margin. Specifi- evenforsomeextremecases. Moreover,whenmodeldrift- cally, SO-DLT outperforms the second best tracker TGPR ing happens, the proposed differentially-paced fine-tuned by 13.8% for the success plots and by 6.9% for the pre- CNNcanalsohelpcorrectthedriftingproblemassoonas cision plots. As for the other top-ranking trackers in the occlusiondisappears. benchmark,theimprovementofSO-DLTisevenmoresig- Duetospacelimitations,weonlyincludetherepresenta- nificant. tiveresultsaboveandleavemoredetailstothesupplemen- We also reportin Tab. 1 and Tab.2 the average success tarymaterial. rates at several thresholds for different methods. SO-DLT 4.3.Non-rigidObjectTrackingDataset consistently outperforms other trackers by a large margin. Consider,forexample,thecasewhentheoverlaprateis0.5 TogainadeeperinsightintotheproposedSO-DLT,we whichisacommonchoiceforcomputingthesuccessrate, conduct an additional analysis on a challenging non-rigid SO-DLT outperforms the nearest baseline by 20.6%. We objecttrackingdatasetproposedin[11]. believethecomparisonissubstantialenoughtodemonstrate In this experiment, we only compare with TGPR [9], thesuperiorityofSO-DLT. whichisthebesttrackeramongallothertrackerscompared For better analysis of the strengths and weaknesses of usingthepreviousbenchmark,andPixelTracker[8],which each tracker, each of the 50 sequences is also annotated isastate-of-the-artnon-rigidobjecttracker.Sincethetarget with attributes that reflect the challenging factors. Fig. 7 ishighlydeformable,itisnotmeaningfultousetheoverlap shows the ranking scores of the leading trackers on differ- ratedefinedbyboundingboxesasaperformancemeasure. entgroupsofsequences, whereeachgroupcorrespondsto Soweonlyusethecentralpixelerrorasourevaluationmet- adifferentattribute. FromFig.7,wecanseethatSO-DLT ric. TheresultsareshowninTab.3. substantiallyoutperformstheotherstate-of-the-arttrackers As we can see, although it is not specially designed 1 e v di a) ( 2 e v di b) ( 2 or ot m c) ( PixelTracker TGPR SO−DLT Figure8.Resultsofsometestsequencesinthenon-rigidtrackingdataset. SO-DLT TGPR PixelTracker 1. In addition to directly transferring high-level features cliff-dive1 14.36 32.77 23.12 inSO-DLTwhichmayintroducetoomuchinvariance cliff-dive2 15.04 17.65 35.86 tocasesthatcontaindistractors,wemaybuildanother motocross1 14.49 188.09 125.71 trackertotransferonlylow-levelfeaturesthatcapture motocross2 12.18 11.95 31.76 the local invariance of object appearance. The two skiing 5.85 282.00 7.22 trackers can then work collaboratively in making the mountain-bike 6.10 5.48 215.05 finaldecision. volleyball 83.77 7.24 115.86 average 21.69 77.88 79.26 Table3.Centralpixelerrorsofdifferenttrackersonthenon-rigid 2. More advanced model update methods may be tried. objecttrackingdataset. Forexample,modelensemblebasedmethods[36,32] have shown great potential recently. They can detect andcorrectthemistakesmadeduringmodelupdateby for tracking non-rigid objects, SO-DLT can still outper- undoingsomeincorrectmodelupdatesorre-weighting form PixelTracker [8]. Compared to the generic tracker eachmodeladaptively. TGPR [9], the improvement is even more significant. SO- DLTcanalwaystrackthetargettotheendwiththeexcep- 3. To better handle irregular object shapes, our model tion of the volleyball sequence. We show some visual re- may output a pixel-wise map of the target in case ap- sultsinFig.8. propriatedatafromthefirstframeisavailableforpre- 4.4.Discussion trainingandinitialization. Althoughourtrackerdemonstratessuperiorperformance We will investigate these and possibly other ideas in our on the two benchmarks, some failure cases do exist and a futurework. detailed analysis of them is important for making further improvement. Herearesomesuchcases: 5.Conclusion 1. The tracker is likely to drift when there exist distrac- In this paper, we have exploited the effectiveness of torsinthebackgroundandwhenthetargetisoccluded. transferring high-level feature hierarchies for visual track- Duetothestrong, invariantrepresentationalpowerof ing. Tothebestofourknowledge,wearethefirsttobring CNN, the distractors with similar appearance may be large-scaleCNNtotheareaofvisualtrackingandshowsig- incorrectly recognized as the target, leading to incor- nificantimprovementoverthestate-of-the-arttrackers. In- recttrackingresults. steadofmodelingtrackingasaproposalclassificationprob- lem,wepresentedanovelstructuredoutputCNNforvisual 2. The tracker can possibly drift if the initial bounding tracking. Moreover, instead of learning to reconstruct the box is not properly specified. This case often occurs inputimageasinpreviouswork,theCNNisfirstpre-trained when the shape of the target is quite irregular so that on the large-scale ImageNet detection dataset [7] to learn only a small portion of the object is specified by the tolocalizeobjectssoastoalleviatetheproblemcausedby boundingboxasthetarget. lackoflabeledtrainingdata. ThisobjectnessCNNisthen transferred and fine-tuned during the online tracking pro- To address these challenges, here are some possible solu- cess. Extensive experimentshave validatedthe superiority tions: ofourSO-DLTtracker. References [19] Z.Kalal,K.Mikolajczyk,andJ.Matas. Tracking-learning- detection. IEEETransactionsonPatternAnalysisandMa- [1] M.Arulampalam,S.Maskell,N.Gordon,andT.Clapp. A chineIntelligence,34(7):1409–1422,2012. 3,7 tutorialonparticlefiltersforonlinenonlinear/non-Gaussian [20] A.Krizhevsky,I.Sutskever,andG.Hinton. ImageNetclas- Bayesiantracking.IEEETransactionsonSignalProcessing, sificationwithdeepconvolutionalneuralnetworks.InNIPS, 50(2):174–188,2002. 2 pages1106–1114,2012. 1,2 [2] B. Babenko, M. Yang, and S. Belongie. Robust object [21] J. Kwon and K. Lee. Visual tracking decomposition. In tracking with online multiple instance learning. IEEE CVPR,pages1269–1276,2010. 2 TransactionsonPatternAnalysisandMachineIntelligence, [22] J.KwonandK.M.Lee.Minimumuncertaintygapforrobust 33(8):1619–1632,2011. 3 visualtracking. InCVPR,pages2355–2362,2013. 2 [3] C.Bailer,A.Pagani,andD.Stricker.Asuperiortrackingap- [23] J.KwonandK.M.Lee. Intervaltracker: Trackingbyinter- proach: Buildingastrongtrackerthroughfusion. InECCV, valanalysis. InCVPR,pages3494–3501,2014. 2 pages170–185.2014. 3 [24] H.Li,Y.Li,andF.Porikli.DeepTrack:Learningdiscrimina- [4] D.S.Bolme,J.R.Beveridge,B.A.Draper,andY.M.Lui. tivefeaturerepresentationsbyconvolutionalneuralnetworks Visual object tracking using adaptive correlation filters. In forvisualtracking. InBMVC,2014. 3 CVPR,pages2544–2550,2010. 3 [25] X.MeiandH.Ling. Robustvisualtrackingusingl mini- 1 [5] F.C.Crow. Summed-areatablesfortexturemapping. SIG- mization. InICCV,pages1436–1443,2009. 3 GRAPH,18(3):207–212,1984. 5 [26] D.Ross,J.Lim,R.Lin,andM.Yang. Incrementallearning [6] J. Dai, K. He, and J. Sun. Boxsup: Exploiting bounding forrobustvisualtracking.InternationalJournalofComputer boxestosuperviseconvolutionalnetworksforsemanticseg- Vision,77(1):125–141,2008. 3 mentation. arXivpreprintarXiv:1503.01640,2015. 4 [27] A. Smeulders, D. Chu, R. Cucchiara, S. Calderara, A. De- [7] J.Deng,W.Dong,R.Socher,L.-J.Li,K.Li,andF.-F.Li.Im- hghan,andM.Shah. Visualtracking: Anexperimentalsur- agenet:Alarge-scalehierarchicalimagedatabase.InCVPR, vey. IEEE Transactions on Pattern Analysis and Machine pages248–255,2009. 2,8 Intelligence,36(7),2014. 2 [8] S.DuffnerandC.Garcia. PixelTrack: Afastadaptivealgo- [28] J.S.SupancicIIIandD.Ramanan. Self-pacedlearningfor rithmfortrackingnon-rigidobjects. InICCV,pages2480– long-termtracking. InCVPR,pages2379–2386,2013. 3 2487,2013. 7,8 [29] A. Torralba, R. Fergus, and W. Freeman. 80 million tiny [9] J. Gao, H. Ling, W. Hu, and J. Xing. Transfer learning images: Alargedatasetfornonparametricobjectandscene basedvisualtrackingwithGaussianprocessesregression.In recognition.IEEETransactionsonPatternAnalysisandMa- ECCV,pages188–203.2014. 3,7,8 chineIntelligence,30(11):1958–1970,2008. 2 [10] R.Girshick,J.Donahue,T.Darrell,andJ.Malik. Richfea- [30] N. Wang, J. Wang, and D.-Y. Yeung. Online robust non- ture hierarchies for accurate object detection and semantic negative dictionary learning for visual tracking. In ICCV, segmentation. InCVPR,pages580–587,2014. 1,2 pages657–664,2013. 3 [11] M.Godec,P.M.Roth,andH.Bischof. Hough-basedtrack- [31] N.WangandD.-Y.Yeung. Learningadeepcompactimage ingofnon-rigidobjects.ComputerVisionandImageUnder- representationforvisualtracking. InNIPS,pages809–817, standing,117(10):1245–1256,2013. 7 2013. 2,3,7 [12] H.Grabner,M.Grabner,andH.Bischof. Real-timetracking [32] N.WangandD.-Y.Yeung.Ensemble-basedtracking:Aggre- viaon-lineboosting. InBMVC,pages47–56,2006. 3 gatingcrowdsourcedstructuredtimeseriesdata. InICML, pages1107–1115,2014. 2,3,8 [13] H. Grabner, C. Leistner, and H. Bischof. Semi-supervised [33] Q.Wang,F.Chen,J.Yang,W.Xu,andM.Yang. Transfer- on-lineboostingforrobusttracking. InECCV,pages234– ringvisualpriorforonlineobjecttracking. IEEETransac- 247,2008. 3 tionsonImageProcessing,21(7):3296–3305,2012. 2 [14] S.Hare,A.Saffari,andP.H.Torr. Struck:Structuredoutput [34] Y. Wu, J. Lim, and M. Yang. Online object tracking: A trackingwithkernels. InICCV,pages263–270,2011. 2,3, benchmark. InCVPR,2013. 2,6,7 7 [35] M.D.ZeilerandR.Fergus. Visualizingandunderstanding [15] K.He,X.Zhang,S.Ren,andJ.Sun.Spatialpyramidpooling convolutionalnetworks. InECCV,pages818–833.2014. 2 in deep convolutional networks for visual recognition. In [36] J.Zhang, S.Ma, andS.Sclaroff. MEEM:Robusttracking ECCV,pages346–361.2014. 3 viamultipleexpertsusingentropyminimization. InECCV, [16] J.F.Henriques,R.Caseiro,P.Martins,andJ.Batista. High- pages188–203.2014. 3,5,8 speed tracking with kernelized correlation filters. IEEE [37] K.Zhang,L.Zhang,Q.Liu,D.Zhang,andM.-H.Yang.Fast TransactionsonPatternAnalysisandMachineIntelligence, visualtrackingviadensespatio-temporalcontextlearning.In 37(3):583–596,2014. 3 ECCV,pages127–141.2014. 3 [17] G. E. Hinton and R. R. Salakhutdinov. Reducing the [38] W. Zhong, H. Lu, and M.-H. Yang. Robust object track- dimensionality of data with neural networks. Science, ingviasparsity-basedcollaborativemodel. InCVPR,pages 313(5786):504–507,2006. 2 1838–1845,2012. 7 [18] X. Jia, H. Lu, and M. Yang. Visual tracking via adaptive [39] X.Zhou, L.Xie, P.Zhang, andY.Zhang. Anensembleof structural local sparse appearance model. In CVPR, pages deepneuralnetworksforobjecttracking. InICIP,2014. 3 1822–1829,2012. 7

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.