ebook img

Tracking The Untrackable: Learning To Track Multiple Cues with Long-Term Dependencies PDF

0.96 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Tracking The Untrackable: Learning To Track Multiple Cues with Long-Term Dependencies

Tracking The Untrackable: Learning to Track Multiple Cues with Long-Term Dependencies AmirSadeghian,AlexandreAlahi,SilvioSavarese StanfordUniversity amirabs,alahi,ssilvio @cs.stanford.edu { } 7 1 0 Abstract !⟺!# 2 T=1 T=t-1 T=t r The majority of existing solutions to the Multi-Target Target … Target Target p A Tracking (MTT) problem do not combine cues in a coher- Appearance Motion Interaction entend-to-endfashionoveralongperiodoftime. However, !# !# $# !# 3 wepresentanonlinemethodthatencodeslong-termtempo- ! ! ! $ raldependenciesacrossmultiplecues.Onekeychallengeof $# $# $# !# ] trackingmethodsistoaccuratelytrackoccludedtargetsor V those which share similar appearance properties with sur- C rounding objects. To address this challenge, we present a T=t−1 T=t . s structure of Recurrent Neural Networks (RNN) that jointly ! $ $# !# c reasons on multiple cues over a temporal window. We are [ able to correct many data association errors and recover 2 observations from an occluded state. We demonstrate the v Figure1.WepresentamethodbasedonastructureofRNNs(each robustness of our data-driven approach by tracking multi- 9 RNNisdepictedbyatrapezoid)thatlearnstoencodelong-term pletargetsusingtheirappearance,motion,andeveninter- 0 temporaldependenciesacrossmultiplecues(appearance,motion, 9 actions. Our method outperforms previous works on mul- and interaction). Our learned representation is used to compute 1 tiple publicly available datasets including the challenging thesimilarityscoresofa“Tracking-by-detection”algorithm.[16] 0 MOTbenchmark. . 1 0 We follow the “tracking-by-detection” paradigm whereby 7 1.Introduction detection outputs are to be connected across video frames. 1 This is often formulated as an optimization problem with : v Architecturesbasedonneuralnetworkshavebecomean respect to a graph [55, 56]. Each detection is represented i X essential instrument in solving perception tasks and have by a node, and edges encode the similarity scores. Over showntoapproachhuman-levelaccuracyinclassifyingim- thepastdecades,researchershavemadesignificantprogress r a ages[18,19]. However, thestatusquooftheMulti-Target in proposing techniques to solve the optimal assignments Tracking(MTT)problemisstillfarfrommatchinghuman of graph-based formulations [86, 1, 34, 64]. However, performance[61,75]. Thisismainlybecauseitisdifficult theirMTTperformancesarelimitedbythespecificdesign for neural networks to capture the inter-relation of targets choicesoftheirrepresentationandthecorrespondingsimi- intimeandspaceusingmulti-modalcues(e.g.,appearance, larityfunction. motion,andinteractions). Inthiswork,wetackletheMTT In crowded environments, occlusions, noisy detections problembyjointlylearningarepresentationthattakesinto (e.g.,falsealarms,missingdetections,non-accuratebound- account several cues over a time period in an end-to-end ing), and appearance variability are very common. In fashion(seeFigure1). traditionalMTTapproaches,representationsandsimilarity The objective of MTT is to infer trajectories of targets functions are hand-crafted in an attempt to capture similar as they move around. It covers a wide range of applica- appearance and motion across adjacent temporal frames tionssuchassportsanalysis[43,51,76],biology(e.g.,birds [34, 64, 75, 77]. In contrast, we propose a method to en- [44],ants[28],fish[66,67,15],cells[45,39]),robotnav- codelong-termtemporaldependenciesacrossmultiplecues igation[11,12],andautonomousdrivingvehicles[13,57]. without the need to hand specify parameters or weights. 1 OurframeworkisbasedonastructureofRecurrentNeural lossfunctionwhichwewilldescribeinSection3.2. Networks (RNN), which has also shown benefits in other applications [26]. The rest of the paper is as follows. In 2.2.MotionModel Section 3, we present details on the inputs of each RNN The target motion model describes how a target moves. andlearningarepresentationthatcanbeusedtocomputea The motion cue is a crucial cue for MTT, since knowing similarity score in an end-to-end fashion. Our appearance the likely position of targets in the future frames will re- model is an RNN constructed on a Convolutional Neural duce the search space and hence increases the appearance Network(CNN)whosepurposeistoclassifyifadetection model accuracy. Popular motion models used in MTT are issimilartoatargetinstanceatdifferenttimeframes. Our divided into linear and non-linear motion models. Linear motionandinteractionmodelsleveragetwoseparateLong motionmodelsfollowalinearmovementwithconstantve- Short-Term Memory (LSTM) networks that track the mo- locity across frames. This simple motion model is one of tionandinteractionsoftargetsforlongerperiod–suitable themostpopularmodelsinMTT[7,47,63,82,53]. How- for presence of long-term occlusions. We then combine ever,therearemanycaseslinearmotionmodelscannotdeal these networks into a structure of RNN to learn to reason with long-term occlusions; to remedy this, non-linear mo- jointly on different cues across time. Our method runs tion models are proposed to produce a more accurate pre- onlinewithouttheneedtoseefutureframes. InSection4, diction[78,79,10]. WepresentaLongShort-TermMem- we present a detailed evaluation of our framework using ory (LSTM) model which learns to predict similar motion multiple benchmarks such as the MOT challenge [37, 46] patterns. It is a fully data-driven approach that can handle andStanforddronedataset[59]. noisydetections. 2.3.InteractionModel 2.RelatedWork Most tracking techniques assume that each target has Inrecentyears,trackinghasbeensuccessfullyextended an independent motion model. This simplification can be to scenarios with multiple targets [52, 38, 24, 75]. As op- problematic in crowded scenes. Interaction models cap- posedtosingletargettrackingapproacheswhichhavebeen ture interactions and forces between different targets in a constructing a sophisticated appearance model to track a scene [20, 23, 73]. Two most popular interaction mod- single target in different frames, multiple target tracking els are the social force models introduced by [20] and the doesnotmainlyfocusonappearancemodel. Althoughap- crowd motion pattern model [23]. Social force models pearance is an important cue, relying only on appearance are also known as group models. In these models, each can be problematic in MTT scenarios where the scene is target reacts to energy potentials caused by interactions highlycrowdedorwhentargetsmaysharethesameappear- with other objects through forces (repulsion or attraction), ance. To this end, some works have been improving the while trying to keep a desired speed and motion direction appearancemodel[17,7],whileothershavebeencombin- [59,3,55,77,9,62,56]. Crowdmotionpatternmodelsare ing the dynamics and interaction between targets with the anothertypeofinteractionmodelsusedinMTT,inspiredby targetappearance[59,3,55,77,9,62,56]. the crowd simulation literature [85, 65]. In general, these kind of models are usually used for over-crowded scenes 2.1.AppearanceModel [90,48,32,33,60,58]. Themaindrawbackofmostthese Simple appearance models are widely used in MTT. methods is that they are limited to a few hand-designed Many models are based on raw pixel template represen- forceterms,suchascollisionavoidanceorgroupattraction. tation for simplicity [77, 4, 74, 55, 54], while color his- Recently,Alahietal. [3]proposedtouseLongShort-Term togram is the most popular representation for appearance Memorynetworkstojointlyreasonacrossmultipleindivid- modeling in MTT approaches [9, 38, 68, 35]. Other ap- uals (referred to as social LSTM). They presented an ar- proaches use covariance matrix representation, pixel com- chitecture to forecast the long-term trajectories of all tar- parisonrepresentation, SIFT-likefeatures, orposefeatures gets. WeuseasimilarLSTMbasedarchitecture. However, [25,83,29,22,50].Recently,deepneuralnetworkarchitec- ourdata-driveninteractionmodelistrainedtosolvethere- tureshavebeenusedformodelingappearance[21,36,84]. identificationtaskasopposedtothelong-termprediction. In these architectures, high-level features are extracted by Finally, when reasoning with multiple cues, previous convolutional neural networks trained for a specific task. workscombinetheminahand-craftedfashionwithoutade- The appearance module of our model shares some char- quatelymodelinglong-termdependencies.Noneofthepre- acteristics with [21], but differs in two crucial ways: first, vious method discussed in sections 2.1, 2.2, and 2.3 com- wehandleocclusionsandsolvethere-identificationtaskby bine appearance, motion, and interaction cues in a coher- learning a similarity metric between two targets. Second, ent end-to-end architecture. In this work, we propose a thenetworkarchitectureisdifferentandweuseadifferent structureofRNNstocopewithsuchlimitationsofprevious � Our learning A Framework SimSciolarerity FC-laye�r �(ti,dj)FC t1 0.23 d1 �i �j �A [ . O.� .M ]�O�I ttt2ni...... 0.120.95 ......dddn2j�1 LSTM�A1 A �A2 A …�At�1 A �At A A M I 0.23 dn Appearance CNN CNN … CNN CNN CNN Targets feature AEpFpxetearaatrucatrnoecr e EFMxetoarattiucotrnoe r InEFtxeetraraatcucttrioeor n at T Deatet cTt+io1ns e(xCtrNacNto)r … Figure 2. We use a structure of RNNs (the dashed rectangle) to BBi1 BBi2 BBit�1 BBit BBjt+1 compute the similarity scores between targets ti and detections dj. The scores are used to construct a bipartite graph between Figure 3. Our appearance model. The inputs are the bounding thetargetsanddetections. ThestructureofRNNsiscomprisedof boxesoftargetifromtime1tot,anddetectionjattimet+1we threeRNNs–Appearance(A),Motion(M),andInteraction(I)– wishtocompare. TheoutputisafeaturevectorφA thatencodes thatarecombinedthroughanotherRNN(referredtoasthetarget iftheboundingboxattimet+1correspondstoaspecifictarget RNN(O)). i at time 1,2,...,t. We use a CNN for our appearance feature extractor. works. We learn a representation that encodes long-term temporal dependencies across multiple cues, i.e., appear- RNN.Moredetailsonthearchitectureandtrainingprocess ance,motion,andinteractionautomaticallyinadata-driven of these RNNs can be found in sections 3.2, 3.3, 3.4, and fashion. 3.5 respectively. The target RNN outputs a feature vector, φ(t,d),whichisusedtooutputthesimilaritybetweenatar- 3.Multi-TargetTrackingFramework gettandadetectiond. By using RNNs, more precisely LSTM networks, we ThetaskofMulti-TargetTracking(MTT)consistsofde- havethecapacitytoencodelong-termdependenciesinthe tecting multiple targets at each time frame and matching sequence of observations. Traditionally, similarity scores theiridentitiesindifferentframes,yieldingtoasetoftarget inagraph-basedtrackingframeworkwerecomputedgiven trajectoriesovertime. Weaddressthisproblembyusinga only the observation from the previous frame, i.e., a pair- “tracking-by-detection”paradigm. Astheinput,thedetec- wise similarity score [86, 1, 34, 64]. Our proposed simi- tionresultsareproducedbyanobjectdetector. Givenanew larity score is computed by reasoning on the sequence of frame, the tracker computes the similarity scores between observations. InSection4.2, wedemonstratethepowerof the already tracked targets and the newly detected objects our representation by reasoning on a sequence of variable (more details in section 3.1). These similarity scores are lengthasopposedtoapairwisesimilarityscore. Intherest calculatedusingourframework(asdescribedinFigure2). ofthissection,wedescribeeachcomponentofourmethod. Theyareusedtoconnectthedetectionsd andtargetst ina j i bipartitegraph,asshowninright-sideofFigure2.Then,the 3.2.Appearance Hungarianalgorithm[49]isusedtofindtheoptimalassign- ments. Inthiswork,weproposeanewmethodtocompute Theunderlyingideaofourappearancemodelisthatwe thesesimilarityscores. cancomputethesimilarityscorebetweenatargetandcan- didatedetectionbasedonpurelyvisualcues. Morespecif- 3.1.OverallArchitecture ically, we can treat this problem as a specific instance of We have identified appearance cues, motion priors, and re-identification, where the goal is to take pairs of bound- interactiveforcesascriticalcuesoftheMTTproblem. As ingboxesanddetermineiftheircontentcorrespondstothe discussedintheintroduction,combiningthesecueslinearly sametarget. Therefore, ourappearancemodelshouldsub- is not necessarily the best way to compute the similarity tle similarities between input pairs, as well as be robust to score. We instead propose to use a structure of RNNs to occlusions and other visual disturbances. The appearance combinethesecuesinaprincipledway. model’soutputfeaturevectorisproducedbyanRNN(A), Inourframework,werepresenteachcuewithanRNN. whichinturnreceivesitsinputfromtheappearancefeature WerefertotheRNNsobtainedfromthesecuesasappear- extractor(seeFigure3). ance (A), motion (M), and interaction (I) RNNs. The fea- Architecture: Our appearance RNN (A) is an LSTM turesrepresentedbytheseRNNs(φ ,φ ,φ )arecombined thatacceptsasinputstheappearancefeaturesfromtheap- A M I through another RNN which is referred to as target (O) pearance feature extractor (φA, ..., φA) and produces H- 1 t dimensional output φi for each timestep. The appearance FC-layer �M featuresarethelasthiddenlayerfeaturesofaConvolutional � NeuralNetwork(CNN). �i �j LetBB1,...,BBt betheboundingboxesoftargetiat timesteps 1i,...,t andi BBt+1 be the detection j we wish LSTM M M … M M FC-layer j Motion tocomparewithtargeti. TheCNNacceptstherawcontent feature within each bounding box and passes it through its layers (evxetlroaccittoyr) vi1 vi2 … vit�1 vit vjt+1 until it finally produces a 500-dimensional feature vector (φAt ). We also pass BBjt+1 (which we wish to determine Figure4.Ourmotionmodel. Theinputsarethe2Dvelocitiesof whetheritcorrespondstothetrueappearancetrajectoryof the target (on the image plane). The output is a feature vector target i or not) through the same CNN that maps it to an φM thatencodesifvelocityvjt+1 correspondstoatruetrajectory H-dimensional vector φj. The LSTM’s output φi is then vi1,vi2,...,vit. concatenatedwiththisvector, andtheresultφispassedto another FC layer which brings the 2H dimensional vector 2H dimensional vector to a k dimensional feature vector toak dimensionalfeaturevectorφ (asillustratedinFig- A φ (as illustrated in Figure 4). We pre-train our motion ure3). Wepre-trainourappearancemodelusingaSoftmax M classifier for 0/1 classification problem, whether BBt+1 modelusingaSoftmaxclassifierfor0/1classificationprob- j lem,whethervelocityvt+1 correspondstoatruetrajectory correspondstoatrueappearancetrajectoryBB1,...,BBt. j i i v1,...,vt.Whencombiningwithothercues,weuseφ of Whencombiningwithothercues,weuseφAofsize500as i i M size500aspartoftheinputtoourtargetRNN(O). partoftheinputtoourtargetRNN(O). Notethatweusea16-layerVGGNetasourCNNinFig- 3.4.InteractionModel ure 3. We begin with the pre-trained weights of this net- work,removethelastFClayerandaddanFClayerofsize Themotionofaparticulartargetisgovernednotonlyby 500sothatthenetworknowoutputsa500-dimensionalvec- itsownpreviousmotion,butalsobythebehaviorofnearby tor. WethentrainthisCNNforthere-identificationtaskfor targets. Weincorporatethiscueintoouroverallframework whichthedetailscanbefoundinSection4.3. by formulating an interaction model. Since the number of nearbytargetscanvary,inordertousethesamesizeinput, 3.3.Motion we model the neighborhood of each target as a fixed size occupancy grid. The occupancy grids are extracted from Thesecondcueofouroverallframeworkistheindepen- ourinteractionfeatureextractor. Foreachtarget,weusean dentmotionpropertyofeachtarget.Itcanhelptrackingtar- LSTM network to model the sequence of occupancy grids getsthatareoccludedorlost.Onekeychallengeistohandle (seeFigure6). the noisy detections. Even when the real motion of a tar- Architecture: LetO1,O2,...,Otrepresentthe2Doc- getislinear,sincedetectionscanbenoisy,thesequenceof i i i cupancygridfortargetiattimesteps1,...,t.Thepositions coordinateshencevelocitiescanbenon-linear–especially of all the neighbors are pooled in this map. The m, n ele- if we reason on the image plane. We train a Long Short- mentofthemapissimplygivenby: TermMemory(LSTM)networkontrajectoriesofnoisy2D velocities(extractedbyourmotionfeatureextractor)tobe Ot(m,n)= 1 [xt xt,yt yt] abletolearnthisnon-linearitiesfromdata(seefigure4). i ∨j∈Ni mn j − i j − i Architecture: Let the velocity of target i at the t-th Where islogicaldisjunction,1 [x,y]isanindicator mn ∨ timestepbedefinedas: functiontocheckifthepersonlocatedat(x,y)isinthe(m, vit =(vxti,vyit)=(xti−xti−1,yit−yit−1), n)cellofthegrid,andNiisthesetofneighborscorrespond- where(xt,yt)arethe2Dcoordinatesofeachtargetonthe ingtopersoni. Themapisfurtherrepresentedasavector i i imageplane(centeroftheboundingboxes). (seeFigure6). Notethatallthe2Dlocationsoftargetsare Our motion RNN (M) is an LSTM that accepts as in- theirequivalentboundingboxcentersontheimageplane. puts the velocities of a specific target at timesteps 1,...,t OurinteractionRNN(I)isanLSTMthatacceptsasin- asmotionfeatures,andproducesanH-dimensionaloutput put the occupancy grids centered on a specific target for φ . We also pass the velocity vector of the detection j at timesteps 1,...,t (extracted by the interaction feature ex- i timestept+1(whichwewishtodeterminewhetheritcor- tractor) and produces H-dimensional output φ for each i responds to the true trajectory of target i or not) through timestep. We also pass the occupancy grid of detection j a fully-connected layer that maps it to an H-dimensional at timestep t+1 (which we wish to determine whether it vector φ (this makes φ the same size as φ ). The LSTM correspondstothetruetrajectoryoftargetiornot)through j j i output is then concatenated with this vector, and the result a fully-connected layer that maps it to an H-dimensional ispassedtoanotherfullyconnectedlayerwhichbringsthe vector space φ (this makes φ the same size as φ ). The j j i �I Method MOTA MOTP Rcll Prcn MT ML FC-layer MDP[75]+Lin 51.5 74.2 74.1 80.1 44.2% 20.9% � MDP+SF[77] 73.5 77.1 84.4 91.5 58.1% 25.5% �i �j MDP+SF-mc[59] 75.6 78.2 86.1 92.6 60% 23.2% Ours(MOT) 78.6 79.4 88.2 93.9 69.7% 19.5% LSTM I I … I I FC-layer Ours(MOT+Drone) 82.9 80.3 92.3 95.3 85% 15.2% Interaction Table 1. MOT tracking results on Stanford Drone Dataset. Our feature (MOT)versionhasbeenonlytrainedontheMOTchallengetrain- extractor (occupancy Frame 1 Frame 2 … Frame t-1 Frame t Frame t+1 ingdataandhasnotbeenfine-tunedontheStanforddronedataset. map) Whereas,our(MOT+Drone)versionhasbeenalsofine-tunedon Figure 5. Our interaction model. The inputs are the occupancy thedronedataset. maps across time (on the image plane). The output is a feature vectorφI thatencodesiftheoccupancymapattimet+1corre- (ii) Second, the target RNN is jointly trained end-to-end spondstoatruetrajectoryofoccupancymapsattime1,2,...,t. with the component RNNs A, M, and I. The output vectors of the A, M, and I networks are concatenated 0 0 into a single feature vector and serve as input to the 0 0 0 0 0 0 0 1 targetRNN.OurtargetRNNhasthecapacitytolearn 1000 00010001000 000 0001000 ……1 long term dependencies of all cues across time. The 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 … last hidden state of the target RNN (H dimensional) 0 0 0 0 0 0 0 1 0 goesthroughafully-connectedlayerresultinginafea- turevectorφ(t,d)thatencodesalltheselongtermde- Figure6.Illustrationofthestepsinvolvedincomputingtheoccu- pendencies of all cues across time. Our target RNN pancymap. Thelocationoftheboundingboxcentersofnearby is also trained to perform the task of data association targets are encoded in a grid –occupancy map– centered around – outputs the score of whether a detection (d) corre- thetarget.Forimplementationpurposes,themapisrepresentedas spondstoatarget(t)fromφ(t,d)usingaSoftmaxclas- avector. sifierandcross-entropyloss. In both above training stages, the networks are trained LSTM output is then concatenated with this vector result- usingMOT15andMOT16trainingdata, inwhichpositive ingintovectorφ,whichispassedtoanotherfullyconnected examples are true pedestrian trajectories (consisting of ap- layerthatbringsthe2H dimensionalvectorφtothespace pearances,velocities,andoccupancymapsdependingofthe of k dimensional feature vector φI (as illustrated in Fig- RNN), and negative examples are constructed by altering ure5). Wepre-trainourinteractionmodelusingaSoftmax thepedestrian’sappearanceorlocationinthefinalframeof classifierfor0/1classificationproblem. Similartomotion thetrajectorysimplybychoosinganothertarget’sproperties model,whencombiningwithothercues,weuseφI ofsize forthefinalframe. 500aspartofinputtoourtargetRNN(O). 4.ExperimentalResults 3.5.Target Wehavepresentedourmulti-cuerepresentationlearning Ouroverarchingmodelshowninfigure2isconstructed framework to compute the similarity score between a se- by combining the appearance, motion, and interaction quence of observations and a new detection. We use our RNNs through another RNN which is referred to as target learned representation to tackle the Multi-target Tracking RNN(O). problem. We first present the overall performance of our Thetrainingproceedsintwostages: framework on the MOT challenge [37] and then present (i) First, networks A, M and I (corresponding to appear- moreinsightsandanalysisonourrepresentation. ance, motion, and interaction RNNs) as well as the 4.1.Multi-TargetTracking CNN (appearance feature extractor) are pre-trained separately. We use a standard Softmax classifier and Torecall,weuseourlearnedrepresentationintheMDP cross-entropy loss. Each RNN outputs the probabili- framework[75]. WehaveonetargetLSTMforeachtarget, ties for the positive and negative classes, where posi- and the MDP framework tracks the targets using the simi- tiveindicatesthatthenewdetectedobjectmatchesthe laritycomputedwithourlearnedrepresentation. previous trajectory of the target (in either case of ap- Metrics. We report the same metric as the suggested pearance,motion,orinteractionproperties,depending ones in the MOT2D Benchmark challenge [37]: Track- on the RNN in charge), and negative indicates other- ing Accuracy (MOTA), Multiple Object Tracking Preci- wise. sion(MOTP),MostlyTracktargets(MT),MostlyLosttar- gets (ML), False Positives (FP), False Negatives (FN), ID whichisafunctionofmultiplecuesacrosstimeandseeks Switches(IDS),andfinallythenumberofframesprocessed tousetherightcuesateachtime. Often,somecuesshould inonesecond(Hz)whichindicatesthespeedofatracking vote for the similarity score since the others are not dis- method. criminant enough or very noisy. To test the power of our Implementation Details. In all experiments the values method,wealsoconductexperimentsbytestingourmulti- ofparametersH,k,andsequencelengthsare128,100,and target tracking experiments on videos that are very differ- 6 respectively for all RNNs. Moreover, in section 3.4 the ent from the MOT challenge [37], i.e., the Stanford Drone imageissampleduniformlywitha15*15gridwherea7*7 Dataset [59]. All targets are small and hence appearance sub-gridcenteredaroundaspecificpersonisusedasitsoc- modelsmightbefaulty(asillustratedinFigure9). Intable cupancy grid. The network hyper-parameters are chosen 1,wecompareourmethodwithpreviouslyreportedMDP- by cross validation and our framework was trained with based methods. Our method outperforms all the MDP- Adam update. Training the RNN’s occurs from scratch, based methods on all metrics. Even without fine-tuning with mini-batch size of 64, and learning rate of 0.002, se- our representation on the drone dataset, our method out- quentiallydecreasedevery10epochsbyafactorof10(for performs previous works. After fine-tuning, we obtain the 50 epochs). Note that this is same for training all RNNs. bestperformanceasexpected. Itshowsthepowerofadata- Moreover,weuseourmethodintheMDPframework[75]. drivenmethodtolearnarepresentationoveranyinputsig- Foreachtarget, MDPhastwoprocesses. First, itindepen- nal. dently tracks the target with a single object tracker based In the reminder of this section, we analyze the perfor- on optical flow. Then, when the target gets occluded, the manceofourrepresentationwithanablationstudyaswell singleobjecttrackerstoptrackingandabi-partitegraphis asmoreinsightsonourappearanceonmorespecifictasks. constructedsimilartoFigure2. TheHungarianalgorithmis usedtorecoveroccludedtargets. NotethatMDPalsopro- 4.2.AblationStudy poses to learn a similarity score given a hand-crafted rep- The underlying motivation of our proposed framework resentation. Wereplacetheirrepresentationwiththeoutput is to address the following two challenges (as listed in the of our target RNN (φ(t,d)) to demonstrate the strength of introduction): effectivelymodelingthehistoryofeachcue, ourlearningmethod. and effectively combining multiple cues. We now present MOTChallengeBenchmark. Wereportthequantita- experiments towards these two goals on the validation set tiveresultsofourmethodonthe2DMOT2015Benchmark ofthe2DMOT2015challenge[37]. Weusethesameeval- [37], and MOT16 [46] in Table 5 and 6. This challenges uation protocol (training and test splits) as in [75] for our share the training and testing set for 11 and 14 sequences validationset. respectively. We used their publicly shared noisy detec- Impact of the History. One of the advantages of our tions. Our method outperforms previous methods on mul- representationcomparedtothepreviousonesisthecapacity tiplemetricssuchastheMOTA,MT,andML.OurMOTA tolearnlongtermdependenciesofcuesacrosstime,i.e.,re- even outperforms offline methods (in 2015 challenge) that taininginformationfromthepast.Weinvestigatetheimpact have access to the whole set of future detections to reason ofchangingthesequencelengthoftheLSTMsontracking on the data association step. Using long term dependen- accuracy,wheresequencelengthofanLSTMisthenumber ciesofmultiplecuesmakesourmethodtorecoverbackto ofunrolledtimestepsusedwhiletrainingtheLSTM.Figure the right target after an occlusion or drift; hence we have 7 (b) shows the MOTA score of different components for higher MT and lower ML but our IDS is higher. Indeed, the validation set, under different LSTM sequence lengths whentargetsareoccluded,ourmethodcanwronglyassign forourtargetLSTM.WecanseethatincreasingtheLSTM them to other detections. But when the targets re-appear, sequencelengthpositivelyimpactstheMOTA.Theperfor- ourmethodre-matchthemwiththecorrectdetections.Such mancesaturatesafter3framesontheStanforddronedataset process leads to a high number of switches. Nevertheless, and after 6 frames on the MOTChallenge dataset. These theMTmetricremainshigh. results confirm our claim that RNN can effectively model The impact of our learned representation becomes evi- thehistoryofacue. Moreover, thedifferencebetweenthe dentcomparedwiththepreviouslypublishedMDPmethod. MOT and Stanford dataset can be explained by the differ- Byonlyswitchingtherepresentationandkeepingthesame ence in the datasets. The drone dataset does not have any dataassociationmethodproposedin[75],weobtaina20% long term occlusion whereas the MOT has full long-term relativeboostinMOTA.Thebenefitsofourrepresentation occlusions.Ourframeworklearnstoencodelong-termtem- arefurtheremphasizedwiththeStanforddataset[59]. poraldependenciesacrossmultiplecuesthathelpsrecover- Stanford Drone Dataset. As we have mentioned be- ingfromlong-termocclusions.Weclaimifmostocclusions fore,oneofthemainadvantagesofourmodelcomparedto arelessthannframelongweatleastneedtokeepdepen- other multi-target tracking methods is the similarity score denciesoverpastnframestobeabletorecovertheobject 150 Tracker MOTA MOTP MT ML FP FN IDS StanfordDroneValidationSet 100 MOTValidationSet A+M+I 30.8 73.8 14 51.7 2,563 13,127 98 nt A+M 28.8 73.9 13.5 52.1 2,776 13,361 134 u Co 50 A+I 27.4 73.9 12 53.4 2,679 13,991 136 M+I 22 73.8 9.8 52.1 2,714 14,954 298 0 A 23.7 73.7 11.5 55.6 3,359 14,001 138 4 6 8 10 12 14 16 18 Occlusion length (frame) M 19.2 73.7 8.5 68.4 3,312 15,023 313 I 15.4 73.5 5.6 69.9 3,061 16,250 354 (a) Table3.AnalysisofourmodelontheMOTvalidationsetusing A30 T differentsetofcomponents(A)Appearance,(M)Motion,and(I) O M20 MOT Validation Set Interaction.WereportthestandardMOTmetrics. 1 2 3 4 5 6 7 8 9 10 80 A 30 OT75 A+M+I M Stanford Drone Dataset 25 A+M 70 A20 A+I 1 2 3 4 5 6 7 8 9 10 T M+I Sequence Length (frame) MO15 A (b) 10 MI Figure7.(a)OcclusionlengthdistributioninMOTandStanford 5 Drone dataset validation set. (b) Analysis of the used sequence 0 1 2 3 4 5 6 length(memory)forourmodelontheMOTvalidationsetforboth Tracker datasets.WereporttheMOTAscores. Figure8.AnalysisofourmodelontheMOTvalidationsetusing Tracker MOTA MOTP MT ML FP FN IDS differentsetofcomponents(A)Appearance,(M)Motion,and(I) Ours 30.8 73.8 14 51.7 2,563 13,127 98 Interaction.WereporttheMOTAscores. Exp1 18.2 71.2 7.1 72.1 3,851 15,893 350 Exp2 12.9 71.0 4.3 75.9 4,259 16,751 396 Method Rank1 Rank5 Rank10 Table 2. Analysis of our model on the MOT validation set com- FPNN[40] 19.9 49.3 64.7 paredtoFCbaselines. First, usingFConlyinsteadofthetarget BoW[91] 23 45 55.7 RNN(O)ortopRNN(Exp1)andsecond,FClayerforallRNNs ConvNet[2] 45 75.3 95 (Exp2). LX[41] 46.3 78.9 88.6 MLAPG[42] 51.2 83.6 92.1 SS-SVM[88] 51.2 80.8 89.6 fromanocclusion.Figure7(a)depictsthatmostocclusions (morethan80percent)happenforlessthan6framesinthe SI-CI[72] 52.2 84.3 92.3 DNS[87] 54.7 84.8 94.8 MOTdatasetwhichsupportswhytheMOTAsaturatesafter sequence length of 6, see figure 7 (b). Whereas since the SLSTM[69] 57.3 80.1 88.3 Drone dataset does not have any long term occlusions our Ours 55.9 81.7 95.1 modeldoesnotneedlongtermdependencies.Nevertheless, Table4.Performancecomparisonourappearancefeatureextractor we can see that modeling the sequence of observations on withstate-of-the-artalgorithmsfortheCUHK03dataset. both datasets positively impacts the similarity score hence trackingperformance. proposedtargetLSTM(inchargeofcombiningalltheother In order to further support our use of RNNs for mod- RNNs)effectivelyreasononallthecuestoincreasetheper- elingthetemporaldependenciesintheMTTframeworkwe formance. Table 3 reports more details on the impact of conductexperimentsusingfully-connectedlayersinsteadof eachcueonthevarioustrackingmetrics. RNNs. Weprovideresultsoftwoexperiments,onereplac- ingonlythetargetLSTMwithanFC,andasecondexper- 4.3.Re-identificationTask imentinwhichwereplacedallLSTMnetworkswithFCs. Table2showstheresultsofthisexperiment. For completeness, we report the performance of our Impact of Multiple Cues. We investigate the contri- appearance cue on re-identification task. We construct a butionofdifferentcuesinourframeworkbymeasuringthe SiameseCNNusingthesamepre-trainedCNNusedasour performanceintermsofMOTAonthevalidationset.Figure appearance feature extractor in Section 3.2. We train our 8presentstheresultsofourablationstudy. Theappearance Siamese CNN on positive and negative samples extracted cueisthemostimportantone. Eachcueshelpstoincrease from two MOT2D and CUHK03 datasets [37, 40]. We theperformance. Itisworthtopointoutourproposedinter- extracted more than 500k of positive and negative sam- actioncuepositivelyimpactstheoverallperformance. Our plesfrom2DMOT2015andCUHK03. IncaseofMOT2D, Figure 9. Qualitative results on the Stanford Drone dataset [59]. The first row presents the tracking results of our method whereas the secondrowpresentstheresultsofMDP+SF-mc[59].ThedashedcirclesillustrateIDswitchesinpreviousmethod. Tracker TrackingMode MOTA MOTP MT ML FP FN IDS Frag Hz ↑ ↑ ↑ ↓ ↓ ↓ ↓ ↓ ↑ SiameseCNN[36] Offline 29.0 71.2 8.5% 48.4% 5,160 37,798 639 1,316 52.8 CNNTCM[71] Offline 29.6 71.8 11.2% 44.0% 7,786 34,733 712 943 1.7 TSMLCDEnew[70] Offline 34.3 71.7 14.0% 39.4% 7,869 31,908 618 959 6.5 JointMC[27] Offline 35.6 71.9 23.2% 39.3% 10,580 28,508 457 969 0.6 TC ODAL[5] Online 15.1 70.5 3.20% 55.80% 12,970 38,538 637 1,716 1.7 RMOT[81] Online 18.6 69.6 5.30% 53.30% 12,473 36,835 684 1,282 7.9 SCEA[79] Online 29.1 71.1 8.9% 47.3% 6,060 36,912 604 1,182 6.8 MDP[75] Online 30.3 71.3 13.00% 38.40% 9,717 32,422 680 1,500 1.1 TDAM[80] Online 33.0 72.8 13.3% 39.1% 10,064 30,617 464 1,506 5.9 Ours Online 37.6 71.7 15.8% 26.8% 7,933 29,397 1,026 2,024 1.0 Table5.Trackingperformanceonthetestsetofthe2DMOT2015Benchmarkwithpublicdetections. Tracker TrackingMode MOTA MOTP MT ML FP FN IDS Frag Hz ↑ ↑ ↑ ↓ ↓ ↓ ↓ ↓ ↑ LINF1[14] Offline 41 74.8 11.60% 51.30% 7,896 99,224 430 963 1.1 MHT DAM[31] Offline 42.9 76.6 13.60% 46.90% 5,668 97,919 499 659 0.8 JMC[78] Offline 46.3 75.7 15.50% 39.70% 6,373 90,914 657 1,114 0.8 NOMT[8] Offline 46.4 76.6 18.30% 41.40% 9,753 87,565 359 504 2.6 OVBT[6] Online 38.4 75.4 7.50% 47.30% 11,517 99,463 1,321 2,140 0.3 EAMTT pub[61] Online 38.8 75.1 7.90% 49.10% 8,114 102,452 965 1,657 11.8 oICF[30] Online 43.2 74.3 11.30% 48.50% 6,651 96,515 381 1,404 0.4 Ours Online 47.2 75.8 14.0% 41.6% 2,681 92,856 774 1,675 1 Table6.TrackingperformanceonthetestsetoftheMOT16Benchmarkwithpublicdetections. we use instances of the same target that occur in different 5.Conclusions framesforpositivepairs, andweuseinstancesofdifferent targetsacrossallframesfornegativepairs. Networkhyper- Wehavepresentedamethodthatencodesdependencies parametersarechosenbycrossvalidation. Themini-batch acrossmultiplecuesoveratemporalwindow. Ourlearned size of 64, learning rate of 0.001, sequentially decreased multi-cue representation is used to compute the similarity every2epochsbyafactor10(for20epochs). Weevaluate scoresinatrackingframework. Weshowedthatbyswitch- ourappearancemodelonCUHK03reidentificationbench- ingtheexistingstate-of-the-artrepresentationwithourpro- mark [89]. Table 4 presents our results for Rank 1, Rank posedone,thetrackingperformance(measuredasMOTA) 5,andRank10accuracies. Ourmethodachieves55.9per- increasesby20%. Consequently,ourmethodranksfirstin cent of accuracy for Rank 1 which is competitive against existing benchmarks. As future work, we plan to use our the state-of-the-art method (57.3%). When measuring the data-drivenmethodtotrackanysocialanimalsuchasants. re-identification rate for Rank 10, our appearance model Theirappearanceanddynamicsarequitedifferentfromhu- outperforms previous methods. This is a crucial indicator mans. Itwillbeexcitingtolearnarepresentationforsuch for showing that our model can extract meaningful feature collective behavior and help researchers in biology to get representationforre-identificationtask. moreinsightsintheirfield. References EuropeanConferenceonComputerVision,pages774–790. Springer,2016. 8 [1] A. Adam, E. Rivlin, and I. Shimshoni. Robust fragments- [15] E. Fontaine, A. H. Barr, and J. W. Burdick. Model-based basedtrackingusingtheintegralhistogram. In2006IEEE trackingofmultiplewormsandfish. InICCVWorkshopon ComputerSocietyConferenceonComputerVisionandPat- DynamicalVision.Citeseer,2007. 1 tern Recognition (CVPR’06), volume 1, pages 798–805. [16] Freepik.Blacksilhouettesofmansandwomanwalking.De- IEEE,2006. 1,3 signedbyFreepik.com. 1 [2] E.Ahmed, M.Jones, andT.K.Marks. Animproveddeep [17] H.Grabner,M.Grabner,andH.Bischof. Real-timetracking learningarchitectureforpersonre-identification.InProceed- viaon-lineboosting. InBMVC,volume1,page6,2006. 2 ingsoftheIEEEConferenceonComputerVisionandPattern [18] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into Recognition,pages3908–3916,2015. 7 rectifiers:Surpassinghuman-levelperformanceonimagenet [3] A.Alahi,K.Goel,V.Ramanathan,A.Robicquet,L.Fei-Fei, classification.InProceedingsoftheIEEEInternationalCon- andS.Savarese.Sociallstm:Humantrajectorypredictionin ferenceonComputerVision,pages1026–1034,2015. 1 crowdedspaces. InProceedingsoftheIEEEConferenceon [19] K.He,X.Zhang,S.Ren,andJ.Sun. Deepresiduallearn- Computer Vision and Pattern Recognition, pages 961–971, ingforimagerecognition. InProceedingsoftheIEEECon- 2016. 2 ferenceonComputerVisionandPatternRecognition,pages [4] S.AliandM.Shah. Floorfieldsfortrackinginhighdensity 770–778,2016. 1 crowdscenes.InComputerVision–ECCV2008,pages1–14. [20] D.HelbingandP.Molnar. Socialforcemodelforpedestrian Springer,2008. 2 dynamics. PhysicalreviewE,51(5):4282,1995. 2 [5] S.-H.BaeandK.-J.Yoon. Robustonlinemulti-objecttrack- ing based on tracklet confidence and online discriminative [21] D.Held,S.Thrun,andS.Savarese. Learningtotrackat100 appearancelearning.In2014IEEEConferenceonComputer fpswithdeepregressionnetworks. InEuropeanConference Vision and Pattern Recognition, pages 1218–1225. IEEE, onComputerVision,pages749–765.Springer,2016. 2 2014. 8 [22] S. Hong and B. Han. Visual tracking by sampling tree- [6] Y.Ban,S.Ba,X.Alameda-Pineda,andR.Horaud. Track- structured graphical models. In European Conference on ingmultiplepersonsbasedonavariationalbayesianmodel. ComputerVision,pages1–16.Springer,2014. 2 InEuropeanConferenceonComputerVision,pages52–67. [23] M.Hu, S.Ali, andM.Shah. Detectingglobalmotionpat- Springer,2016. 8 ternsincomplexvideos.InPatternRecognition,2008.ICPR [7] M. D. Breitenstein, F. Reichlin, B. Leibe, E. Koller-Meier, 2008. 19th International Conference on, pages 1–5. IEEE, andL.VanGool. Robusttracking-by-detectionusingade- 2008. 2 tector confidence particle filter. In Computer Vision, 2009 [24] C.Huang,B.Wu,andR.Nevatia. Robustobjecttrackingby IEEE 12th International Conference on, pages 1515–1522. hierarchicalassociationofdetectionresponses.InComputer IEEE,2009. 2 Vision–ECCV2008,pages788–801.Springer,2008. 2 [8] W.Choi. Near-onlinemulti-targettrackingwithaggregated [25] H. Izadinia, V. Ramakrishna, K. M. Kitani, and D. Huber. local flow descriptor. In Proceedings of the IEEE Interna- Multi-pose multi-target tracking for activity understanding. tional Conference on Computer Vision, pages 3029–3037, In Applications of Computer Vision (WACV), 2013 IEEE 2015. 8 Workshopon,pages385–390.IEEE,2013. 2 [9] W.ChoiandS.Savarese. Multipletargettrackinginworld [26] A.Jain,A.R.Zamir,S.Savarese,andA.Saxena.Structural- coordinate with single, minimally calibrated camera. In rnn: Deeplearningonspatio-temporalgraphs. InProceed- Computer Vision–ECCV 2010, pages 553–567. Springer, ingsoftheIEEEConferenceonComputerVisionandPattern 2010. 2 Recognition,pages5308–5317,2016. 2 [10] C.Dicle,O.I.Camps,andM.Sznaier. Thewaytheymove: [27] M. Keuper, S. Tang, Y. Zhongjie, B. Andres, T. Brox, Tracking multiple targets with similar appearance. In Pro- and B. Schiele. A multi-cut formulation for joint seg- ceedingsoftheIEEEInternationalConferenceonComputer mentation and tracking of multiple objects. arXiv preprint Vision,pages2304–2311,2013. 2 arXiv:1607.06317,2016. 8 [11] A.Elfes.Usingoccupancygridsformobilerobotperception [28] Z.Khan,T.Balch,andF.Dellaert. Anmcmc-basedparticle andnavigation. Computer,22(6):46–57,1989. 1 filterfortrackingmultipleinteractingtargets. InComputer [12] A.Ess,B.Leibe,K.Schindler,andL.VanGool. Amobile Vision-ECCV2004,pages279–290.Springer,2004. 1 visionsystemforrobustmulti-persontracking. InComputer [29] B. Y. S. Khanloo, F. Stefanus, M. Ranjbar, Z.-N. Li, Vision and Pattern Recognition, 2008. CVPR 2008. IEEE N.Saunier, T.Sayed, andG.Mori. Alargemarginframe- Conferenceon,pages1–8.IEEE,2008. 1 work for single camera offline tracking with hybrid cues. [13] A. Ess, K. Schindler, B. Leibe, and L. Van Gool. Im- Computer Vision and Image Understanding, 116(6):676– provedmulti-persontrackingwithactiveocclusionhandling. 689,2012. 2 InICRAWorkshoponPeopleDetectionandTracking, vol- [30] H. Kieritz, S. Becker, W. Hu¨bner, and M. Arens. On- ume2.Citeseer,2009. 1 line multi-person tracking using integral channel features. [14] L. Fagot-Bouquet, R. Audigier, Y. Dhome, and F. Lerasle. In Advanced Video and Signal Based Surveillance (AVSS), Improving multi-frame data association with sparse repre- 2016 13th IEEE International Conference on, pages 122– sentations for robust near-online multi-object tracking. In 130.IEEE,2016. 8 [31] C.Kim,F.Li,A.Ciptadi,andJ.M.Rehg. Multiplehypoth- [45] E.Meijering,O.Dzyubachyk,I.Smal,andW.A.vanCap- esis tracking revisited. In Proceedings of the IEEE Inter- pellen.Trackingincellanddevelopmentalbiology.InSemi- nationalConferenceonComputerVision,pages4696–4704, narsincell&developmentalbiology,volume20,pages894– 2015. 8 902.Elsevier,2009. 1 [32] L.KratzandK.Nishino.Trackingwithlocalspatio-temporal [46] A.Milan,L.Leal-Taixe,I.Reid,S.Roth,andK.Schindler. motionpatternsinextremelycrowdedscenes. InComputer Mot16: A benchmark for multi-object tracking. arXiv VisionandPatternRecognition(CVPR),2010IEEEConfer- preprintarXiv:1603.00831,2016. 2,6 enceon,pages693–700.IEEE,2010. 2 [47] A.Milan,S.Roth,andK.Schindler.Continuousenergymin- [33] L. Kratz and K. Nishino. Tracking pedestrians using lo- imizationformultitargettracking.IEEEtransactionsonpat- cal spatio-temporal motion patterns in extremely crowded ternanalysisandmachineintelligence,36(1):58–72,2014.2 scenes. IEEEtransactionsonpatternanalysisandmachine [48] P. Mordohai and G. Medioni. Dimensionality estima- intelligence,34(5):987–1002,2012. 2 tion, manifold learning and function approximation using [34] C.-H.KuoandR.Nevatia. Howdoespersonidentityrecog- tensor voting. Journal of Machine Learning Research, nition help multi-person tracking? In Computer Vision 11(Jan):411–450,2010. 2 andPatternRecognition(CVPR),2011IEEEConferenceon, [49] J.Munkres. Algorithmsfortheassignmentandtransporta- pages1217–1224.IEEE,2011. 1,3 tionproblems. Journalofthesocietyforindustrialandap- pliedmathematics,5(1):32–38,1957. 3 [35] N.Le,A.Heili,andJ.-M.Odobez.Long-termtime-sensitive [50] H.NamandB.Han. Learningmulti-domainconvolutional costsforcrf-basedtrackingbydetection. InEuropeanCon- neural networks for visual tracking. In Proceedings of the ference on Computer Vision, pages 43–51. Springer, 2016. IEEEConferenceonComputerVisionandPatternRecogni- 2 tion,pages4293–4302,2016. 2 [36] L.Leal-Taixe´,C.Canton-Ferrer,andK.Schindler. Learning [51] P.Nillius,J.Sullivan,andS.Carlsson.Multi-targettracking- bytracking:Siamesecnnforrobusttargetassociation.arXiv linkingidentitiesusingbayesiannetworkinference.InCom- preprintarXiv:1604.07866,2016. 2,8 puterVisionandPatternRecognition,2006IEEEComputer [37] L.Leal-Taixe´,A.Milan,I.Reid,S.Roth,andK.Schindler. SocietyConferenceon,volume2,pages2187–2194.IEEE, MOTChallenge 2015: Towards a benchmark for multi- 2006. 1 target tracking. arXiv:1504.01942 [cs], Apr. 2015. arXiv: [52] K.Okuma,A.Taleghani,N.DeFreitas,J.J.Little,andD.G. 1504.01942. 2,5,6,7 Lowe. A boosted particle filter: Multitarget detection and [38] B.Leibe,K.Schindler,N.Cornelis,andL.VanGool. Cou- tracking. In Computer Vision-ECCV 2004, pages 28–39. pled object detection and tracking from static cameras and Springer,2004. 2 movingvehicles.PatternAnalysisandMachineIntelligence, [53] S. Oron, A. Bar-Hille, and S. Avidan. Extended lucas- IEEETransactionson,30(10):1683–1698,2008. 2 kanadetracking. InEuropeanConferenceonComputerVi- [39] K.Li, E.D.Miller, M.Chen, T.Kanade, L.E.Weiss, and sion,pages142–156.Springer,2014. 2 P. G. Campbell. Cell population tracking and lineage con- [54] S.Oron, A.Bar-Hillel,andS.Avidan. Real-timetracking- structionwithspatiotemporalcontext. Medicalimageanal- with-detectionforcopingwithviewpointchange. Machine ysis,12(5):546–566,2008. 1 VisionandApplications,26(4):507–518,2015. 2 [40] W.Li,R.Zhao,T.Xiao,andX.Wang.Deepreid:Deepfilter [55] S.Pellegrini,A.Ess,K.Schindler,andL.VanGool. You’ll pairingneuralnetworkforpersonre-identification. InPro- neverwalkalone: Modelingsocialbehaviorformulti-target ceedings of the IEEE Conference on Computer Vision and tracking. InComputerVision,2009IEEE12thInternational PatternRecognition,pages152–159,2014. 7 Conferenceon,pages261–268.IEEE,2009. 1,2 [41] S.Liao,Y.Hu,X.Zhu,andS.Z.Li. Personre-identification [56] S.Pellegrini,A.Ess,andL.VanGool. Improvingdataas- by local maximal occurrence representation and metric sociation by joint modeling of pedestrian trajectories and learning. InProceedingsoftheIEEEConferenceonCom- groupings. In European Conference on Computer Vision, puter Vision and Pattern Recognition, pages 2197–2206, pages452–465.Springer,2010. 1,2 2015. 7 [57] A. Petrovskaya and S. Thrun. Model based vehicle detec- [42] S.LiaoandS.Z.Li. Efficientpsdconstrainedasymmetric tionandtrackingforautonomousurbandriving.Autonomous metriclearningforpersonre-identification. InProceedings Robots,26(2-3):123–139,2009. 1 of the IEEE International Conference on Computer Vision, [58] E.RistaniandC.Tomasi. Trackingmultiplepeopleonline pages3685–3693,2015. 7 andinrealtime. InAsianConferenceonComputerVision, [43] C.-W. Lu, C.-Y. Lin, C.-Y. Hsu, M.-F. Weng, L.-W. Kang, pages444–459.Springer,2014. 2 and H.-Y. M. Liao. Identification and tracking of players [59] A. Robicquet, A. Sadeghian, A. Alahi, and S. Savarese. in sport videos. In Proceedings of the Fifth International Learning social etiquette: Human trajectory understanding ConferenceonInternetMultimediaComputingandService, in crowded scenes. In European Conference on Computer pages113–116.ACM,2013. 1 Vision,pages549–565.Springer,2016. 2,5,6,8 [44] W.Luo,T.-K.Kim,B.Stenger,X.Zhao,andR.Cipolla. Bi- [60] M.Rodriguez,S.Ali,andT.Kanade. Trackinginunstruc- label propagation for generic multiple object tracking. In turedcrowdedscenes.In2009IEEE12thInternationalCon- Proceedings of the IEEE Conference on Computer Vision ferenceonComputerVision,pages1389–1396.IEEE,2009. andPatternRecognition,pages1290–1297,2014. 1 2

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.