Table Of ContentROBUSTANDREAL-TIMEDEEPTRACKINGVIAMULTI-SCALEDOMAINADAPTATION
XinyuWang1,HanxiLi1∗,YiLi2,FuminShen3,FatihPorikli4
JiangxiNormalUniversity,China1
ToyotaResearchInstituteofNorthAmerica,USA2
UniversityofElectronicScienceandTechnologyofChina,China3
AustralianNationalUniversity,Australia4
7
1
ABSTRACT
0
2 Visualtrackingisafundamentalproblemincomputervision. Tracking
n Recently,somedeep-learning-basedtrackingalgorithmshave Car Domain
a beenachievingrecord-breakingperformances. However,due
J to the high complexity of deep learning, most deep trackers Human Face
3
suffer from low tracking speed, and thus are impractical in
many real-world applications. Some new deep trackers with
]
V smallernetworkstructureachievehighefficiencywhileatthe Classification
Domain
C cost of significant decrease on precision. In this paper, we
. proposetotransferthefeatureforimageclassificationtothe
s
visualtrackingdomainviaconvolutionalchannelreductions.
c
[ The channel reduction could be simply viewed as an addi-
tional convolutional layer with the specific task. It not only Fig. 1. The high level concept of the proposed MSDAT
1
extractsusefulinformationforobjecttrackingbutalsosignif- tracker. Left: most of the deep neural network is pretrained
v
1 icantlyincreasesthetrackingspeed. Tobetteraccommodate for image classification, where the learning algorithm focus
6 the useful feature of the target in different scales, the adap- onobjectclasses. Right: anadaptationisperformedtotrans-
5 tation filters are designed with different sizes. The yielded fer the classification features to the visual tracking domain,
0
visualtrackerisreal-timeandalsoillustratesthestate-of-the- where the learning algorithm treats the individual object in-
0
art accuracies in the experiment involving two well-adopted dependently.
.
1
benchmarkswithmorethan100testvideos.
0
7 IndexTerms— visualtracking,deeplearning,real-time
1
:
v 1. INTRODUCTION
i
X
Visual tracking is one of the long standing computer vision
r
a tasks. During the last decade, as the surge of deep learning, transfers the features from the classification domain to the
moreandmoretrackingalgorithmsbenefitfromdeepneural trackingdomain,wheretheindividualobjects,ratherthanthe
networks,e.g. ConvolutionalNeuralNetworks[1,2]andRe- image categories, play as the learning subjects. In addition,
current Neural Networks [3, 4]. Despite the well-admitted theadaptationcouldbealsoviewedasadimension-reduction
success, a dilemma still existing in the community is that, processthatremovestheredundantinformationfortracking,
deep learning increases the tracking accuracy, while at the and more importantly, reduces the channel number signifi-
costofhighcomputationalcomplexity.Asaresult,mostwell- cantly. This leads to a considerable improvement on track-
performing deep trackers usually suffer from low efficiency ing speed. Figure 1 illustrates the adaptation procedure. To
[5,6]. Recently,somereal-timedeeptrackerswereproposed accommodatethevariousfeaturesofthetargetobjectindif-
[7, 8]. They achieved very fast tracking speed, but can not ferentscales,wetrainfilterswithdifferentsizes,asproposed
beat the shallow methods in some important evaluations, as in the Inception network [9] in the domain adaptation layer.
weillustratelatter. Our experiment shows that the proposed MSDAT algorithm
In this paper, a simple yet effective domain adaptation runsinaround35FPSwhileachievesveryclosetrackingac-
algorithm is proposed. The facilitated tracking algorithm, curacytothestate-of-the-arttrackers. Toourbestknowledge,
termed Multi-Scale Domain Adaptation Tracker (MSDAT), ourMSDATisthebest-performingreal-timevisualtracker.
2. RELATEDWORK ispre-trainedusingtheILSVRCdataset[16]forimageclas-
sification, where the learning algorithm usually focus on the
Similar to other fields of computer vision, in recent years, objectcategories. Thisisdifferentfromvisualtrackingtasks,
more and more state-of-the-art visual trackers are built on wheretheindividualobjectsaredistinguishedfromotherones
deeplearning.[1]isawell-knownpioneeringworkthatlearns (eventhosefromthesamecategory)andthebackground. In-
deep features for visual tracking. The DeepTrack method tuitively,itisbettertotransfertheclassificationfeaturesinto
[10,2]learnsadeepmodelfromscratchandupdatesitonline thevisualtrackingdomain.
andachieveshigheraccuracy. [11,12]adoptsimilarlearning
strategies, i.e., learning the deep model offline with a large
number of images while updating it online for the current
Multi-Scale Adaption
video sequence. [13] achieves real-time speed via replacing Conv3_5
32@56x56 𝜔1
theslowmodelupdatewithafastinferenceprocess. 𝜔2
The HCF tracker [5] extracts hierarchical convolutional 𝜔3
Conv4_5
features from the VGG-19 network [14], then puts the fea- 64@28x28
KCF Respond
tures into correlation filters to regress the respond map. It
Conv5_5
can be considered as a combination between deep learning 64@14x14
and the fast shallow tracker based on correlation filters. It
achieveshightrackingaccuracywhilethespeedisaround10
Input Conv1 Conv2 Conv3 Conv4 Conv5
fps. HyeonseobNametal. proposedtopre-traindeepCNNs 3@224x224 64@224x224 112@128x128 256@56x56 512@28x28 512@14x14
in multi domains, with each domain corresponding to one
trainingvideosequence[6]. Theauthorsclaimthatthereex- Fig. 2. The network structure of the proposed MSDAT
istssomecommonpropertiesthataredesirablefortargetrep- tracker.Threelayers,namely,conv3 5,conv4 5andconv5 5
resentationsinalldomainssuchasilluminationchanges. To are selected as feature source. The domain adaption (as
extractthesecommonfeatures,theauthorsseparatedomain- showninyellowlines)reducesthechannelnumberby8times
independent information from domain-specific layers. The and keeps feature map size unchanged. Better viewed in
yielded tracker, termed MD-net, achieves excellent tracking color.
performancewhilethetrackingspeedisonly1fps.
In this work, we propose to perform the domain adapta-
Recently, some real-time deep trackers have also been
tion in a simple way. A “tracking branch” is “grafted” onto
proposed. In[7],DavidHeldetal. learnadeepregressorthat
each feature layer, as shown in Fig. 2. The tracking branch
canpredictthelocationofthecurrentobjectbasedonitsap-
is actually a convolution layer which reduces the channel
pearanceinthelastframe. Thetrackerobtainsamuchfaster
number by 8 times and keeps feature map size unchanged.
trackingspeed(over100fps)comparingtoconventionaldeep
Theconvolutionlayeristhenlearnedviaminimizingtheloss
trackers. Similarly, in [8] a fully-convolutional siamese net-
functiontailoredfortracking,asintroducedbelow.
work is learned to match the object template in the current
frame. It also achieves real-time speed. Even though these
real-timedeeptrackersalsoillustratehightrackingaccuracy, 3.2. Learningstrategy
there is still a clear performance gap between them and the
The parameters in the aforementioned tracking branch is
state-of-the-artdeeptrackers.
learnedfollowingasimilarmannerasSingleShotMultiBox
Detector (SSD), a state-of-the-art detection algorithm [17].
3. THEPROPOSEDMETHOD When training, the original layers of VGG-19 (i.e. those
onesbeforeconvx 5arefixedandeach“trackingbranch”is
In this section, we introduce the details of the proposed trained independently) The flowchart of the learning proce-
tracking algorithm, i.e., the Multi-Scale Domain Adaptation dureforonetrackingbranch(basedonconv3 4)isillustrated
Tracker(MSDAT). inupperrowofFigure3,comparingwiththelearningstrategy
ofMD-net[6](thebottomrow). Toobtainacompletedtrain-
ing circle, the adapted feature in conv3 5 is used to regress
3.1. Networkstructure
thobjects’locationsandtheirobjectnessscores(showninthe
In HCF [5], deep features are firstly extracted from multi- dashedblock). Pleasenotethatthedeeplearningstageinthis
plelayersfromtheVGG-19network[14], andasetofKCF work is purely offline and the additional part in the dashed
[15] trackers are carried out on those features, respectively. blockwillbeabandonedbeforetracking.
The final tracking prediction is obtained in a weighted vot- InSSD,anumberof“defaultboxes”aregeneratedforre-
ingmanner. Followingthesettingin[5], wealsoextractthe gressingtheobjectrectangles. Furthermore,toaccommodate
deep features from conv3 5, conv4 5 and conv5 5 network the objects in different scales and shapes, the default boxes
layersoftheVGG-19model. However,theVGG-19network also vary in size and aspect ratios. Let m ∈ {1,0} be an
i,j
labelednegativeinonedomaincouldbeselectedasapositive
Conv132_@5_1n4ox1rm4_loc Location sample in another domain. Given the training video number
Smooth 𝑙1Loss isC andthedimensionofthelastconvolutionlayerisd ,the
c
SoftmaxLoss MD-netlearnsC independentdc×2fully-connectedalterna-
Conv3_5_norm_conf Class tivelyusingC soft-maxlosses,i.e.,
12@14x14
Input Conv1 Conv2 Conv3 Conv3_5 Training
3@224x224 64@224x224128@112x112256@56x5632@56x56 Mi :Rdc →R2,∀i=1,2,...,C (4)
MSDAT fc
Softmax Cross where Mi ,∀i ∈ {1,2,...,C} denotes the C fully-
EntropyLoss fc
𝑭2𝒄𝟔𝟏 connectedlayersthattransferringthecommonvisualdomain
Softmax Cross
totheindividualobjectdomain,asshowninFigure3.
EntropyLoss
𝑭𝒄𝟔𝟐
2 DifferingfromtheMD-net,thedomaininthisworkrefers
Softmax Cross
Input Conv1 Conv2 Conv3 Fc4 Fc5 EntropyLoss toageneralvisualtrackingdomain,ormorespecifically,the
3@107x10796@51x51256@11x11512@3x3512 512 𝑭𝒄𝟔𝟑
MD-Net Tracker 2 KCF domain. It is designed to mimic the KCF input in vi-
sual tracking (see Figure 3). In this domain, different track-
Fig. 3. The flow-charts of the training process of MSDAT ing targets are treated as one category, i.e., objects. When
and MD-net. Note that the network parts inside the dashed training, the object’s location and confidence (with respect
blocksareonlyusedfortrainingandwillbeabandonedbefore to the objectness) are regressed to minimize the smoothed
tracking. Betterviewedincolor. l loss. Mathematically, we learn a single mapping function
1
M (·)as
conv
M :Rdc →R4 (5)
msdat
indicatorformatchingthei-thdefaultboxtothej-thground
wheretheR4spaceiscomposedofoneR2spacefordisplace-
truthbox. ThelossfunctionofSSDwrites:
ment{x,y}andonelabelspaceR2.
1
Compared with Equation 4, the training complexity in
L(m,c,l,g)= (L (m,c)+αL (m,l,g)) (1)
N conf loc Equation5decreasesandthecorrespondingconvergencebe-
comesmorestable. Ourexperimentprovesthevalidityofthe
where c is the category of the default box, l is the predicted
proposeddomainadaptation.
bounding-boxwhilegistheground-truthoftheobjectbox,if
applicable. Forthej-thdefaultboxandthei-thground-truth,
thelocationlossLi,j iscalculatedas 3.3. Multi-scaledomainadaptation
loc
(cid:88) As introduced above, the domain adaption in our MSDAT
Li,j(l,g)= m ·smooth (lu−gˆu) (2)
loc i,j L1 i j methodisessentiallyaconvolutionlayer. Todesignthelayer,
u∈{x,y,w,h} an immediate question is how to select a proper size for the
filters. AccordingtoFigure2,thefeaturemapsfromdifferent
wheregˆu,u ∈ {x,y,w,h}isoneofthegeometryparameter
layers vary in size significantly. It is hard to find a optimal
ofnormalizedground-truthbox.
filer size for all the feature layers. Inspired by the success
However,thetaskofvisualtrackingdiffersfromdetection
ofInceptionnetwork[9],weproposetosimultaneouslylearn
significantly. WethustailorthelossfunctionfortheKCFal-
the adaptation filters in different scales. The response maps
gorithm,whereboththeobjectsizeandtheKCFwindowsize
withdifferentfiltersizesarethenconcatenatedaccordingly,as
are fixed. Recall that, the KCF window plays a similar role
showninFigure4. Inthisway,theinputoftheKCFtracker
asdefaultboxesinSSD[15], wethenonlyneedtogenerate
involvesthedeepfeaturesfromdifferentscales.
one type of default boxes and the location loss Li,j(l,g) is
loc Inpractice,weuse3×3and5×5filtersforallthethree
simplifiedas
featurelayers. GiventheoriginalchannelnumberisK,each
(cid:88) typeoffiltergenerate K channelsandthusthechannelreduc-
Li,j(l,g)= m ·smooth (lu−gu) (3) 16
loc i,j L1 i j tionratioisstill8:1.
u∈{x,y}
Inotherwords,onlythedisplacement{x,y}istakenintocon- 3.4. Makethetrackerreal-time
siderationandthereisnoneedforground-truthboxnormal-
3.4.1. Channelreduction
ization.
Note that the concept of domain adaptation in this work Oneimportantadvantageoftheproposeddomainadaptation
is different from that defined in MD-net [6], where differ- istheimprovementofthetrackingspeed. Itiseasytoseethat
entvideosequencesaretreatedasdifferentdomainsandthus the speed of KCF tracker drops dramatically as the channel
multiple fully-connected layers are learned to handle them numberincrease. Inthiswork,aftertheadaptation,thechan-
(see Figure 3). This is mainly because in MD-net samples nelnumberisshrunkby8timeswhichacceleratesthetracker
thetraininginstancesinasliding-windowmanner,Anobject by2to2.5times.
rulesasHCFtracker[5],theinputwindowis10%largerthan
7x7 the KCF window, both in terms of width and height. Facili-
tatedbythelazyfeed-forwardstrategy,intheproposedalgo-
rithm,feed-forwardisconductedonlyonceinmorethan60%
videoframes. Thisgivesusanother50%speedgain.
5x5
4. EXPERIMENT
4.1. Experimentsetting
3x3
Inthissection,wereporttheresultsofaseriesofexperiment
involving the proposed tracker and some state-of-the-art ap-
Conv3_4 Conv3_5 proaches. OurMSDATmethodiscomparedwithsomewell-
256@56x56 36@56x56
performingshallowvisualtrackersincludingtheKCFtracker
[15],TGPR[18],Struck[19],MIL[20],TLD[21]andSCM
Fig.4. Learntheadaptationlayerusingthreedifferenttypes
[22]. Also, some recently proposed deep trackers including
offilters
MD-net[6],HCF[5],GOTURN[7]andtheSiamesetracker
[8] are also compared. All the experiment is implemented
3.4.2. Lazyfeed-forward in MATLAB with matcaffe [23] deep learning interface, on
acomputerequippedwithaInteli74770KCPU,aNVIDIA
Anothereffectivewaytoincreasethetrackingspeedistore-
GTX1070graphiccardand32GRAM.
duce the number of feed-forwards of the VGG-19 network.
The code of our algorithm is published in Bitbucket
InHCF,thefeed-forwardprocessisconductfortwotimesat
viahttps://bitbucket.org/xinke_wang/msdat,
eachframe,oneforpredictionandoneformodelupdate[5].
pleaserefertotherepositoryfortheimplementationdetails.
However,wenoticethatthedisplacementofthemovingob-
jectisusuallysmallbetweentwoframes.Consequently,ifwe
4.2. ResultsonOTB-50
maketheinputwindowslightlylargerthantheKCFwindow,
one can reuse the feature maps in the updating stage if the
Similartoitsprototype[24],theObjectTrackingBenchmark
new KCF window (defined by the predicted location of the
50 (OTB-50) [25] consists 50 video sequences and involves
object)stillresidesinsidetheinputwindow. Wethuspropose
51 tracking tasks. It is one of the most popular tracking
alazyfeed-forwardstrategy,whichisdepictedinFigure5.
benchmarks since the year 2013, The evaluation is based on
twometrics: centerlocationerrorandboundingboxoverlap
ratio.Theone-passevaluation(OPE)isemployedtocompare
our algorithm with the HCF [5], GOTURN [7], the Siamese
tracker[8]andtheaforementionedshallowtrackers. There-
Last position
sultcurvesareshowninFigure6
Current position FromFigure6wecansee,theproposedMSDATmethod
beats all the competitor in the overlapping evaluation while
margin rankssecondinthelocationerrortest,withatrivialinferiority
(around1%)toitsprototype,theHCFtracker. Recallthatthe
MSDATbeatstheHCFwiththesimilarsuperiorityandruns
3timesfasterthanHCF,oneconsidertheMSDATasasuper
variationoftheHCF,withmuchhigherspeedandmaintains
its accuracy. From the perspective of real-time tracking, our
method performs the best in both two evaluations. To our
Fig. 5. The illustration of lazy feed-forward strategy. To best knowledge, the proposed MSDAT method is the best-
predict the location of the object (the boy’s head), a part of performingreal-timetrackerinthiswell-acceptedtest.
theimage(greenwindow)iscroppedforgeneratingthenet-
workinput. Notethatthegreenwindowisslightlylargerthan
4.3. ResultsonOTB-100
theredblock,i.e.,theKCFwindowforpredictingthecurrent
location. Ifthepredictedlocation(showninyellow)stillre- TheObjectTrackingBenchmark100istheextensionofOTB-
sides inside the green lines, one can reuse the deep features 50 and contains 100 video sequences. We test our method
bycroppingthecorrespondingfeaturemapsaccordingly. underthesameexperimentprotocolasOTB-50andcompar-
ingwithalltheaforementionedtrackers. Thetestresultsare
Inthiswork,wegeneratetheKCFwindowusingthesame reportedinTable1
Precision plots Success plots
1 1
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
e
noisicerP0.5 HCFT(11fps) [89.07] tar sseccu0.5 ours(32fps) [61.41]
0.4 ours(32fps) [88.01] S0.4 SiamFC(58fps) [61.22]
DeepTrack(3fps) [82.60] HCFT(11fps) [60.47]
0.3 SiamFC(58fps) [81.53] 0.3 DeepTrack(3fps) [58.92]
TGPR(0.66fps) [76.61] TGPR(0.66fps) [52.94]
KCF(245fps) [74.24] KCF(245fps) [51.64]
0.2 Struck(10fps) [65.61] 0.2 SCM(0.37fps) [49.90]
SCM(0.37fps) [64.85] Struck(10fps) [47.37]
0.1 GOTURN(165fps) [62.51] 0.1 GOTURN(165fps) [45.01]
TLD(22fps) [60.75] TLD(22fps) [43.75]
MIL(28fps) [47.47] MIL(28fps) [35.91]
0 0
0 5 10 15 20 25 30 35 40 45 50 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Location error threshold Overlap threshold
Fig.6. Thelocationerrorplotsandtheoverlappingaccuracyplotsoftheinvolvingtrackers,testedontheOTB-50dataset.
Sequence Ours HCF MD-Net SiamFC GOTURN KCF Struck MIL SCM TLD
DPrate(%) 83.0 83.7 90.9 75.2 56.39 69.2 63.5 43.9 57.2 59.2
OS(AUC) 0.567 0.562 0.678 0.561 0.424 0.475 0.459 0.331 0.445 0.424
Speed(FPS) 34.8 11.0 1 58 165 243 9.84 28.0 0.37 23.3
Table1. TrackingaccuraciesofthecomparedtrackersonOTB-100
As can be seen in the table, the proposed MSDAT algo- 5. CONCLUSIONANDFUTUREWORK
rithmkeepitssuperiorityoveralltheotherreal-timetrackers
and keep the similar accuracy to HCF. The best-performing In this work, we propose a simple yet effective algorithm to
MD-net(accordingtoourbestknowledge)enjoysaremark- transferring the features in the classification domain to the
ableperformancegapoveralltheothertrackerswhilerunsin visual tracking domain. The yielded visual tracker, termed
around1fps. MSDAT, is real-time and achieves the comparable tracking
accuracies to the state-of-the-art deep trackers. The experi-
mentverifiesthevalidityoftheproposeddomainadaptation.
Admittedly, updating the neural network online can lift
thetrackingaccuracysignificantly[2,6]. However,theexist-
4.4. Thevalidityofthedomainadaptation
ingonlineupdatingschemeresultsindramaticalspeedreduc-
tion.Onepossiblefuturedirectioncouldbetosimultaneously
Tobetterverifytheproposeddomainadaptation,herewerun
updatetheKCFmodelandacertainpartoftheneuralnetwork
another variation of the HCF tracker. For each feature layer
(e.g. thelastconvolutionlayer). Inthisway,onecouldstrike
(conv3 4, conv4 4, conv5 4)ofVGG-19, onerandomlyse-
the balance between accuracy and efficiency and thus better
lectsoneeighthofthechannelsfromthislayer. Inthisway,
trackercouldbeobtained.
the input channel numbers to KCF are identical to the pro-
posedMSDATandthusthealgorithmcomplexityofthe“ran- 6. REFERENCES
domHCF”andourmethodarenearlythesame. Thecompar-
[1] NaiyanWangandDit-YanYeung, “Learningadeepcompactimage
isonofMSDAT,HCFandrandomHCFonOTB-50isshown
representationforvisualtracking,”inNIPS,pp.809–817.2013.
inFigure7
[2] Hanxi Li, Yi Li, and Fatih Porikli, “Deeptrack: Learning discrimi-
Fromthecurvesonecanseealargegapbetweentheran- nativefeaturerepresentationsonlineforrobustvisualtracking,” IEEE
TransactionsonImageProcessing(TIP),vol.25,no.4,pp.1834–1848,
domized HCF and the other two methods. In other words,
2016.
theproposeddomainadaptationnotonlyreducethechannel
[3] Anton Milan, Seyed Hamid Rezatofighi, Anthony Dick, Konrad
number, but also extract the useful features for the tracking
Schindler,andIanReid, “Onlinemulti-targettrackingusingrecurrent
task. neuralnetworks,”arXivpreprintarXiv:1604.03635,2016.
Precision plots
1 Success plots
1
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
noisicerP00..45 etar sseccuS00..45
0.3 0.3
0.2 0.2
HCFT [89.07]
ours [61.41]
0.1 ours [88.01] 0.1 HCFT [60.47]
random [72.54] random [50.68]
0 0
0 10 20 30 40 50 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Location error threshold Overlap threshold
Fig.7. ThelocationerrorplotsandtheoverlappingaccuracyplotsofthethreeversionoftheHCFtracker: theoriginalHCF,
theMSDATandtherandomHCFmethod. TestedontheOTB-50dataset,betterviewedincolor.
[4] GuanghanNing,ZhiZhang,ChenHuang,ZhihaiHe,XiaoboRen,and [16] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev
HaohongWang, “Spatiallysupervisedrecurrentconvolutionalneural Satheesh,SeanMa,ZhihengHuang,AndrejKarpathy,AdityaKhosla,
networksforvisualobjecttracking,”arXivpreprintarXiv:1607.05781, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei, “ImageNet
2016. LargeScaleVisualRecognitionChallenge,” InternationalJournalof
ComputerVision(IJCV),vol.115,no.3,pp.211–252,2015.
[5] ChaoMa,Jia-BinHuang,XiaokangYang,andMing-HsuanYang,“Hi-
erarchicalconvolutionalfeaturesforvisualtracking,” inICCV,2015, [17] WeiLiu,DragomirAnguelov,DumitruErhan,ChristianSzegedy,and
pp.3074–3082. Scott Reed, “Ssd: Single shot multibox detector,” arXiv preprint
arXiv:1512.02325,2015.
[6] Hyeonseob Nam and Bohyung Han, “Learning multi-domain con-
volutional neural networks for visual tracking,” arXiv preprint [18] Jin Gao, Haibin Ling, Weiming Hu, and Junliang Xing, “Transfer
arXiv:1510.07945,2015. learningbasedvisualtrackingwithgaussianprocessesregression,” in
ECCV,pp.188–203.2014.
[7] David Held, Sebastian Thrun, and Silvio Savarese, “Learning to
track at 100 fps with deep regression networks,” arXiv preprint [19] SamHare,AmirSaffari,andPhilipHSTorr,“Struck:Structuredoutput
arXiv:1604.01802,2016. trackingwithkernels,”inICCV,2011,pp.263–270.
[8] Luca Bertinetto, Jack Valmadre, Joa˜o F Henriques, Andrea Vedaldi, [20] BorisBabenko,Ming-HsuanYang,andSergeBelongie,“Visualtrack-
andPhilipHSTorr, “Fully-convolutionalsiamesenetworksforobject ingwithonlinemultipleinstancelearning,”IEEETransactionsonPat-
tracking,”inECCV,2016,pp.850–865. ternAnalysisandMachineIntelligence(TPAMI),pp.1619–1632,2011.
[9] ChristianSzegedy,WeiLiu,YangqingJia,PierreSermanet,ScottReed, [21] Zdenek Kalal, Jiri Matas, and Krystian Mikolajczyk, “Pn learning:
DragomirAnguelov,DumitruErhan,VincentVanhoucke,andAndrew Bootstrappingbinaryclassifiersbystructuralconstraints,” inCVPR,
Rabinovich, “Goingdeeperwithconvolutions,” inCVPR,2015,pp. 2010,pp.49–56.
1–9. [22] WeiZhong,HuchuanLu,andMing-HsuanYang,“Robustobjecttrack-
ingviasparsity-basedcollaborativemodel,”inCVPR,2012,pp.1838–
[10] HanxiLi,YiLi,andFatihPorikli, “Deeptrack: Learningdiscrimina-
1845.
tivefeaturerepresentationsbyconvolutionalneuralnetworksforvisual
tracking,”BMVC,2014. [23] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev,
JonathanLong, RossGirshick, SergioGuadarrama, andTrevorDar-
[11] NaiyanWang,SiyiLi,AbhinavGupta,andDit-YanYeung, “Transfer-
rell,“Caffe:Convolutionalarchitectureforfastfeatureembedding,”in
ringrichfeaturehierarchiesforrobustvisualtracking,” arXivpreprint
ACMMM,2014,pp.675–678.
arXiv:1501.04587,2015.
[24] YiWu,JongwooLim,andMing-HsuanYang,“Onlineobjecttracking:
[12] SeunghoonHong,TackgeunYou,SuhaKwak,andBohyungHan,“On-
Abenchmark,”inCVPR,2013,pp.2411–2418.
line tracking by learning discriminative saliency map with convolu-
tionalneuralnetwork,”inICML,2015,pp.597–606. [25] YiWu,JongwooLim,andMing-HsuanYang,“Objecttrackingbench-
mark,” IEEETransactionsonPatternAnalysisandMachineIntelli-
[13] KaihuaZhang,QingshanLiu,YiWu,andMing-HsuanYang, “Robust
gence(TPAMI),vol.37,no.9,pp.1834–1848,2015.
trackingviaconvolutionalnetworkswithoutlearning,” arXivpreprint
arXiv:1501.04505,2015.
[14] K.SimonyanandA.Zisserman,“Verydeepconvolutionalnetworksfor
large-scaleimagerecognition,”CoRR,vol.abs/1409.1556,2014.
[15] Joa˜o F Henriques, Rui Caseiro, Pedro Martins, and Jorge Batista,
“High-speedtrackingwithkernelizedcorrelationfilters,” IEEETrans-
actionsonPatternAnalysisandMachineIntelligence(TPAMI),vol.37,
no.3,pp.583–596,2015.