Table Of Content

Max-Margin Object Detection DavisE.King davis@dlib.net Abstract This approach does not make efficient use of the avail- 5 able training data since it trains on only a subset of image 1 Most object detection methods operate by applying a windows. Additionally, windows partially overlapping an 0 binary classifier to sub-windows of an image, followed object are a common source of false alarms. This training 2 by a non-maximum suppression step where detections on proceduremakesitdifficulttodirectlyincorporatetheseex- n overlapping sub-windows are removed. Since the number amplesintothetrainingsetsincethesewindowsareneither a of possible sub-windows in even moderately sized image fullyafalsealarmoratruedetection. Mostimportantly,the J datasetsisextremelylarge,theclassifieristypicallylearned accuracy of the object detection system as a whole, is not 1 from only a subset of the windows. This avoids the com- optimized. Instead, the accuracy of a binary classifier on 3 putational difficulty of dealing with the entire set of sub- thesubsampledtrainingsetisusedasaproxy. ] windows,however,aswewillshowinthispaper,itleadsto In this work, we show how to address all of these is- V sub-optimaldetectorperformance. sues. In particular, we will show how to design an opti- C In particular, the main contribution of this paper is the mizerthatrunsoverallwindowsandoptimizestheperfor- s. introduction of a new method, Max-Margin Object Detec- manceofanobjectdetectionsystemintermsofthenumber c tion(MMOD),forlearningtodetectobjectsinimages.This of missed detections and false alarms in the final system [ methoddoesnotperformanysub-sampling,butinsteadop- output. Moreover, our formulation leads to a convex opti- 1 timizes over all sub-windows. MMOD can be used to im- mizationandweprovideanalgorithmwhichfindstheglob- v prove any object detection method which is linear in the allyoptimalsetofparameters. Finally,wetestourmethod 6 learned parameters, such as HOG or bag-of-visual-word on three publicly available datasets and show that it sub- 4 models. Using this approach we show substantial perfor- stantially improves the accuracy of the learned detectors. 0 0 mance gains on three publicly available datasets. Strik- Strikingly,wefindthatasinglerigidHOGfiltercanoutper- 0 ingly, we show that a single rigid HOG filter can outper- form a state-of-the-art deformable part model if the HOG 2. form a state-of-the-art deformable part model on the Face filterislearnedviaMMOD. 0 DetectionDataSetandBenchmarkwhentheHOGfilteris 5 learnedviaMMOD. 2.RelatedWork 1 : In their seminal work, Dalal and Triggs introduced the v i 1.Introduction Histogram of Oriented Gradients (HOG) feature for de- X tectingpedestrianswithinaslidingwindowframework[3]. r Detectingthepresenceandpositionofobjectsinanim- Subsequent object detection research has focused primar- a ageisafundamentaltaskincomputervision. Forexample, ily on finding improved representations. Many recent ap- trackinghumansinvideoorperformingsceneunderstand- proachesincludefeaturesforpart-based-modeling,methods ingonastillimagerequirestheabilitytoreasonaboutthe for combining local features, or dimensionality reduction number and position of objects. While great progress has [6, 20, 16, 4, 18]. All these methods employ some form been made in recent years in terms of feature sets, the ba- of binary classifier trained on positive and negative image sic training procedure has remained the same. In this pro- windows. cedure, a set of positive and negative image windows are Incontrast,BlaschkoandLampert’sresearchintostruc- selected from training images. Then a binary classifier is tured output regression is the most similar to our own [2]. trained on these windows. Lastly, the classifier is tested As with our approach, they use a structural support vector onimagescontainingnotargetsofinterest,andfalsealarm machineformulation,whichallowsthemtotrainonallwin- windowsareidentifiedandaddedintothetrainingset. The dow locations. However, their training procedure assumes classifier is then retrained and, optionally, this process is animagecontainseither0or1objects.Whileinthepresent iterated. work,weshowhowtotreatobjectdetectioninthegeneral 1 Algorithm1ObjectDetection Input: imagex,windowscoringfunctionf 1: D:=allrectanglesr ∈Rsuchthatf(x,r)>0 2: SortDsuchthatD1 ≥D2 ≥D3 ≥... 3: y∗ :={} 4: fori=1to|D|do 5: ifDidoesnotoverlapanyrectangleiny∗then 6: y∗ :=y∗∪{Di} Figure1.Threeslidingwindowsandtheirf scores.Assumenon- 7: endif maxsuppressionrejectsanyrectangleswhichtouch.Thentheop- 8: endfor timal detector would select the two outside rectangles, giving a 9: Return: y∗,Thedetectedobjectpositions. totalscoreof12,whileagreedydetectorselectsthecenterrectan- gleforatotalscoreofonly7. settingwhereanimagemaycontainanynumberofobjects. 4.Max-MarginObjectDetection 3.ProblemDefinition Inthiswork,weconsideronlywindowscoringfunctions In what follows, we will use r to denote a rectangular which are linear in their parameters. In particular, we use area of an image. Additionally, let R denote the set of functionsoftheform all rectangular areas scanned by our object detection system. To incorporate the common non-maximum suppres- f(x,r)=(cid:104)w,φ(x,r)(cid:105) (3) sion practice, we define a valid labeling of an image as a subset of R such that each element of the labeling “does where φ extracts a feature vector from the sliding window not overlap” with each other. We use the following popu- location r in image x, and w is a parameter vector. If we lardefinitionof“doesnotoverlap”: rectanglesr andr do denotethesumofwindowscoresforasetofrectangles,y, 1 2 notoverlapiftheratiooftheirintersectionareatototalarea asF(x,y),thenEquation(2)becomes coveredislessthan0.5. Thatis, (cid:88) y∗ =argmaxF(x,y)=argmax (cid:104)w,φ(x,r)(cid:105). (4) Area(r1∩r2) <0.5. (1) y∈Y y∈Y r∈y Area(r ∪r ) 1 2 Then we seek a parameter vector w which leads to the fewest possible detection mistakes. That is, given a ran- Finally,weuseY todenotethesetofallvalidlabelings. domly selected image and label pair (x ,y ) ∈ X × Y, Then, givenanimagexandawindowscoringfunction i i wewouldlikethescoreforthecorrectlabelingofx tobe f,wecandefinetheobjectdetectionprocedureas i largerthanthescoresforalltheincorrectlabelings. There- (cid:88) fore, y∗ =argmax f(x,r). (2) F(x ,y )>maxF(x ,y) (5) i i i y∈Y r∈y y(cid:54)=yi shouldbesatisfiedasoftenaspossible. That is, find the set of sliding window positions which have the largest scores but simultaneously do not overlap. 4.1. The Objective Function for Max-Margin Ob- Thisistypicallyaccomplishedwiththegreedypeaksorting jectDetection methodshowninAlgorithm1. Anideallearningalgorithm Our algorithm takes a set of images {x ,x ,...,x } ⊂ wouldfindthewindowscoringfunctionwhichjointlymin- 1 2 n X andassociatedlabels{y ,y ,...,y } ⊂ Y andattempts imized the number of false alarms and missed detections 1 2 n tofindaw suchthatthedetectormakesthecorrectpredic- producedwhenusedinAlgorithm1. tion on each training sample. We take a max-margin ap- It should be noted that solving Equation (2) exactly is proach[10]andrequirethatthelabelforeachtrainingsam- notcomputationallyfeasible. Thus,thisalgorithmdoesnot pleiscorrectlypredictedwithalargemargin. Thisleadsto alwaysfindtheoptimalsolutionto(2). Anexamplewhich thefollowingconvexoptimizationproblem: leadstosuboptimalresultsisshowninFigure1. However, as we will see, this suboptimal behavior does not lead to 1 min ||w||2 (6) difficulties. Moreover, in the next section, we give an op- w 2 timizationalgorithmcapableoffindinganappropriatewin- s.t. F(x ,y )≥max[F(x ,y)+(cid:52)(y,y )], ∀i i i i i dowscoringfunctionforusewithAlgorithm1. y∈Y 2 Where(cid:52)(y,y )denotesthelossforpredictingalabeling 4.2.SolvingtheMMODOptimizationProblem i ofywhenthetruelabelingisy .Inparticular,wedefinethe i We use the cutting plane method [9, 17] to solve the lossas Max-Margin Object Detection optimization problem de- (cid:52)(y,y )=L ·(#ofmisseddetections)+ (7) fined by Equation (8). Note that MMOD is equivalent to i miss thefollowingunconstrainedproblem L ·(#offalsealarms) fa 1 awchheierevinLgmhisigsharnedcaLllfaandcohnigtrholprtehceisrieolna,tirveespiemcptiovretlayn.ce of mwinJ(w)= 2||w||2+Remp(w) (13) Equation (6) is a hard-margin formulation of our learn- whereR (w)is emp ing problem. Since real world data is often noisy, not per- fectlyseparable,orcontainsoutliers,weextendthisintothe n C (cid:88) soft-margin setting. In doing so, we arrive at the defining n my∈aYx[F(xi,y)+(cid:52)(y,yi)−F(xi,yi)]. (14) optimizationforMax-MarginObjectDetection(MMOD) i=1 1 C (cid:88)n Further, note that Remp is a convex function of w and mw,iξn 2||w||2+ n ξi (8) thereforeislowerboundedbyanytangentplane. Thecut- i=1 tingplanemethodexploitsthistofindtheminimizerofJ. s.t. F(xi,yi)≥max[F(xi,y)+(cid:52)(y,yi)]−ξi, ∀i Itdoesthisbybuildingaprogressivelymoreaccuratelower y∈Y bounding approximation constructed from tangent planes. ξ ≥0, ∀i i Each step of the algorithm finds a new w minimizing this approximation. Then it obtains the tangent plane to R In this setting, C is analogous to the usual support vector emp atw,andincorporatesthisnewplaneintothelowerbound- machine parameter and controls the trade-off between try- ingfunction,tighteningtheapproximation. Asketchofthe ingtofitthetrainingdataorobtainalargemargin. procedureisshowninFigure2. Insightintothisformulationcanbegainedbynotingthat each ξ is an upper bound on the loss incurred by training i example (x ,y ). This can be seen as follows (let g(x) = i i argmax F(x,y)) y∈Y ξ ≥max[F(x ,y)+(cid:52)(y,y )]−F(x ,y ) (9) i i i i i y∈Y ξ ≥[F(x ,g(x ))+(cid:52)(g(x ),y )]−F(x ,y ) (10) i i i i i i i ξ ≥(cid:52)(g(x ),y ) (11) i i i Inthestepfrom(9)to(10)wereplacethemaxoverY with a particular element, g(x ). Therefore, the inequality con- Figure 2. The red curve is lower bounded by its tangent planes. i tinues to hold. In going from (10) to (11) we note that Addingthetangentplanedepictedbythegreenlinetightensthe F(x ,g(x ))−F(x ,y ) ≥ 0 since g(x ) is by definition lowerboundfurther. i i i i i theelementofY whichmaximizesF(x ,·). i Therefore, the MMOD objective function defined by Equation (8) is a convex upper bound on the average loss Let ∂R (w ) denote the subgradient of R at a pertrainingimage emp t emp pointw . ThenatangentplanetoR atw isgivenby t emp t n C (cid:88) n (cid:52)(argmaxF(xi,y),yi). (12) (cid:104)w,a(cid:105)+b (15) y∈Y i=1 where This means that, for example, if ξ from Equation (8) is i driventozerothenthedetectorisguaranteedtoproducethe a∈∂R (w ) (16) emp t correctoutputfromthecorrespondingtrainingexample. b=R (w )−(cid:104)w ,a(cid:105). (17) This type of max-margin approach has been used suc- emp t t cessfullyinanumberofotherdomains. Anexampleisthe Giventheseconsiderations,thelowerboundingapprox- Hidden Markov SVM [1], which gives state-of-the-art re- imationweuseis sults on sequence labeling tasks. Other examples include multiclass SVMs and methods for learning probabilistic 1 1 ||w||2+R (w)≥ ||w||2+ max [(cid:104)w,a(cid:105)+b] (18) contextfreegrammars[10]. 2 emp 2 (a,b)∈P 3 Algorithm2MMODOptimizer Algorithm3QuadraticProgramSolverforEquation(23) Input: ε≥0 Input: Q,b,λ,ε ≥0 qp 1: w0 :=0,t:=0,P :={} 1: τ =10−10 2: repeat 2: repeat 3: t:=t+1 3: (cid:53):=Qλ−b 4: Compute plane tangent to Remp(wt−1), select at ∈ 4: big :=−∞ ∂Remp(wt−1)andbt :=Remp(wt−1)−(cid:104)wt−1,at(cid:105) 5: little:=∞ 5: Pt :=Pt−1∪{(at,bt)} 6: l:=0 6: LetKt(w)= 21||w||2+max(ai,bi)∈Pt[(cid:104)w,ai(cid:105)+bi] 7: b:=0 7: wt :=argminwKt(w) 8: fori=1to|(cid:53)|do 8: until 21||wt||2+Remp(wt)−Kt(wt)≤ε 9: if(cid:53)i >bigandλi >0 then 9: Return: wt 10: big :=(cid:53)i 11: b:=i 12: endif whereP isthesetoflowerboundingplanes, i.e. the“cut- 13: if(cid:53)i <little then tingplanes”. 14: little:=(cid:53)i PseudocodeforthismethodisshowninAlgorithm2. It 15: l:=i executes until the gap between the true MMOD objective 16: endif functionandthelowerboundislessthanε.Thisguarantees 17: endfor convergencetotheoptimalw∗ towithinε. Thatis,wewill 18: gap:=λT (cid:53)−little have 19: z :=λb+λl |J(w∗)−J(wt)|<ε (19) 20: x:=max(τ,Qbb+Qll−2Qbl) uponterminationofAlgorithm2. 21: λb :=λb−(big−little)/x 22: λl :=λl+(big−little)/x 23: ifλb <0then 4.2.1 Solving the Quadratic Programming Subprob- 24: λb :=0 lem 25: λl :=z A key step of Algorithm 2 is solving the argmin on step 26: endif 7. This subproblem can be written as a quadratic program 27: until gap≤εqp and solved efficiently using standard methods. Therefore, 28: Return: λ inthissectionwederiveasimplequadraticprogramsolver forthisproblem. Webeginbywritingstep7asaquadratic programandobtain After a little algebra, the dual reduces to the following quadraticprogram, 1 min ||w||2+ξ (20) w,ξ 2 1 s.t. ξ ≥(cid:104)w,ai(cid:105)+bi, ∀(ai,bi)∈P. max λTb− λTQλ (23) λ 2 The set of variables being optimized, w, will typically |P| (cid:88) havemanymoredimensionsthanthenumberofconstraints s.t. λi ≥0, λi =1 intheaboveproblem.Therefore,itismoreefficienttosolve i=1 thedualproblem. Todothis,notethattheLagrangianis whereλandbarecolumnvectorsofthevariablesλ andb i i L(w,ξ,λ)= 1||w||2+ξ−(cid:88)|P| λ (ξ−(cid:104)w,a (cid:105)−b ). (21) respectivelyandQij =(cid:104)ai,aj(cid:105). 2 i i i WeuseasimplifiedvariantofPlatt’ssequentialminimal i=1 optimizationmethodtosolvethedualquadraticprogramof andsothedual[5]ofthequadraticprogramis Equation (23) [15]. Algorithm 3 contains the pseudocode. In each iteration, the pair of Lagrange multipliers (λ , λ ) b l max L(w,ξ,λ) (22) which most violate the KKT conditions are selected (lines w,ξ,λ 6-13). Thentheselectedpairisjointlyoptimized(lines15- s.t. (cid:53) L(w,ξ,λ)=0, w 21). Thesolverterminateswhenthedualitygapislessthan (cid:53) L(w,ξ,λ)=0, ξ athreshold. λi ≥0, ∀i Uponsolvingfortheoptimalλ∗,thew neededbystep t 4 7ofAlgorithm2isgivenby Algorithm4LossAugmentedDetection Input: image x, true object positions y, weight vector w, |P| w =−(cid:88)λ∗a . (24) Lmiss,Lfa t i i 1: D := all rectangles r ∈ R such that (cid:104)w,φ(x,r)(cid:105) + i=1 L >0 fa The value of minwK(w) needed for the test for conver- 2: SortDsuchthatD1 ≥D2 ≥D3 ≥... gencecanbeconvenientlycomputedas 3: sr :=0,hr :=false, ∀r ∈y 4: fori=1to|D|do λ∗Tb− 1||w ||2. (25) 5: ifDidoesnotoverlap{Di−1,Di−2,...}then 2 t 6: ifDimatchesanelementofythen 7: r :=bestmatchingelementofy Additionally, there are a number of non-essential but useful implementation tricks. In particular, the starting λ 8: ifhr =falsethen shouldbeinitializedusingtheλfromthepreviousiteration 9: sr :=(cid:104)w,φ(x,Di)(cid:105) oftheMMODoptimizer. Also,cuttingplanestypicallybe- 10: hr :=true 11: else comeinactiveafterasmallnumberofiterationsandcanbe safelyremoved. Acuttingplaneisinactiveifitsassociated 12: sr :=sr+(cid:104)w,φ(x,Di)(cid:105)+Lfa 13: endif Lagrange multiplier is 0. Our implementation removes a 14: endif cuttingplaneifithasbeeninactivefor20iterations. 15: endif 16: endfor 4.2.2 ComputingRempanditsSubgradient 17: y∗ :={} The final component of our algorithm is a method for 18: fori=1to|D|do computingRemp andanelementofitssubgradient. Recall 19: ifDidoesnotoverlapy∗then thatF(x,y)andRempare 20: ifDimatchesanelementofythen 21: r :=bestmatchingelementofy F(x,y)=(cid:88)(cid:104)w,φ(x,r)(cid:105) (26) 22: ifsr >Lmissthen r∈y 23: y∗ :=y∗∪{Di} n 24: endif C (cid:88) R (w)= max[F(x ,y)+(cid:52)(y,y )−F(x ,y )]. 25: else emp n i=1 y∈Y i i i i 26: y∗ :=y∗∪{Di} (27) 27: endif 28: endif ThenanelementofthesubgradientofR is emp 29: endfor   30: Return: y∗,Thedetectedobjectpositions. n C (cid:88) (cid:88) (cid:88) ∂Remp(w)= n  φ(xi,r)− φ(xi,r) i=1 r∈yi∗ r∈yi (28) first rectangle which matches a truth rectangle then, since where the rectangles are sorted in descending order of score, we willrejectallotherswhichmatchitaswell. Thisoutcome (cid:34) (cid:35) (cid:88) results in a single value of L . Alternatively, if we ac- y∗ =argmax (cid:52)(y,y )+ (cid:104)w,φ(x ,r)(cid:105) . (29) miss i i i ceptthefirstrectanglewhichmatchesatruthrectanglethen y∈Y r∈y wegainitsdetectionscore. Additionally, wemayalsoob- Ourmethodforcomputingy∗ isshowninAlgorithm4. tainadditionalscoresfromsubsequentduplicatedetections, i It is a modification of the normal object detection proce- each of which contributes the value of its window scoring dure from Algorithm 1 to solve Equation (29) rather than function plus Lfa. Therefore, Algorithm 4 computes the (2). Therefore,thetaskofAlgorithm4istofindthesetof totalscorefortheacceptcaseandchecksitagainstLmiss. rectangleswhichjointlymaximizethetotaldetectionscore Itthenselectstheresultwiththelargestvalue. Inthepseu- andloss. docode,thesescoresareaccumulatedinthesr variables. Therearetwocasestoconsider. First,ifarectangledoes Thisalgorithmisgreedyandthusmayfailtofindtheop- nothitanytruthrectangles,thenitcontributespositivelyto timaly∗ accordingtoEquation(29). However,itisgreedy i theargmaxinEquation29wheneveritsscoreplustheloss in much the same way as the detection method of Algo- perfalsealarm(L )ispositive. Second,ifarectanglehits rithm1. Moreover,sinceourgoalfromtheoutsetistofind fa atruthrectanglethenwereasonasfollows: ifwerejectthe asetofparameterswhichmakesAlgorithm1performwell, 5 weshoulduseatrainingprocedurewhichrespectstheprop- Finally, the sliding window classification can be imple- ertiesofthealgorithmbeingoptimized. Forexample,ifthe mented efficiently using a set of integral images. We also correctoutputinthecaseofFigure1wastoselectthetwo scan the sliding window over every location in an image boxesonthesides,thenAlgorithm1wouldmakeamistake pyramid which downsamples each layer by 4/5. To de- while a method which was optimal would not. Therefore, cideiftwodetectionsoverlapforthepurposesofnon-max itisimportantforthelearningproceduretoaccountforthis suppression we use Equation (1). Similarly, we use Equa- and learn that in such situations, if Algorithm 1 is to pro- tion(1)todetermineifadetectionhitsatruthbox. Finally, duce the correct output, the side rectangles need a larger allexperimentswererunonasingledesktopworkstation. scorethanthemiddlerectangle. Ultimately, itisonlynec- essaryforAlgorithm4togiveavalueofR whichupper 5.1.TUDarmstadtCows emp bounds the average loss per training image. In our experi- Weperformed10-foldcross-validationontheTUDarm- ments,wealwaysobservedthistobethecase. stadt cows [14] dataset and obtained perfect detection re- sultswithnofalsealarms. Thebestpreviousresultsonthis 5.ExperimentalResults datasetachieveanaccuracyof98.2%atequalerrorrate[2]. To test the effectiveness of MMOD, we evaluate it on The dataset contains 112 images, each containing a side- the TU Darmstadt cows [14], INRIA pedestrians [3], and viewofacow. FDDB [8] datasets. When evaluating on the first two Inthistesttheslidingwindowwas174pixelswideand datasets, we use the same feature extraction (φ) and pa- 90 tall. Training on the entire cows dataset finishes in 49 rameter settings (C, ε, ε , L , and L ), which are iterationsandtakes70seconds. qp fa miss set as follows: C = 25, ε = 0.15C, L = 1, and fa L = 2. This value of ε means the optimization runs 5.2.INRIAPedestrians miss untilthepotentialimprovementinaveragelosspertraining WealsotestedMMODontheINRIApedestriandataset example is less than 0.15. For the QP subproblem, we set and followed the testing methodology used by Dalal and ε = min(0.01,0.1(1||w ||2 +R (w )−K (w ))) to qp 2 t emp t t t Triggs[3].Thisdatasethas2,416croppedimagesofpeople allowtheaccuracywithwhichwesolvethesubproblemto fortrainingaswellas912negativeimages.Fortestingithas varyastheoveralloptimizationprogresses. 1,132peopleimagesand300negativeimages. For feature extraction, we use the popular spatial pyramid bag-of-visual-words model [13]. In our implementation, each window is divided into a 6x6 grid. Within each grid location we extract a 2,048 bin histogram of visual- words. The visual-word histograms are computed by ex- tracting 36-dimensional HOG [3] descriptors from each pixellocation,determiningwhichhistogrambinthefeature isclosesttoo,andadding1tothatvisual-word’sbincount. Next,thevisual-wordhistogramsareconcatenatedtoform the feature vector for the sliding window. Finally, we add a constant term which serves as a threshold for detection. Therefore,φproduces73,729dimensionalfeaturevectors. The local HOG descriptors are 36 dimensional and are extracted from 10x10 grayscale pixel blocks of four 5x5 pixelcells. Eachcellcontains9unsignedorientationbins. Figure3.Theyaxismeasuresthemissrateonpeopleimageswhile Bilinearinterpolationisusedforassigningvotestoorienta- thexaxisshowsFPPWobtainedwhenscanningthedetectorover tionbinsbutnotforspatialbins. negative images. Our method improves both miss rate and false Todeterminewhichvisual-wordaHOGdescriptorcor- positivesperwindowcomparedtopreviousmethodsontheINRIA responds to, many researchers compute its distance to an dataset. exemplar for each bin and assign the vector to the nearest bin. However,thisiscomputationallyexpensive,soweuse afastapproximatemethodtodeterminebinassignment. In Thenegativetestingimageshaveanaverageof199,834 particular,weusearandomprojectionbasedlocalitysensi- pixelsperimage. Wescanourdetectoroveranimagepyra- tivehash[7].Thisisaccomplishedusing11randomplanes. midwhichdownsamplesatarateof4/5andstopwhenthe A HOG vector is hashed by recording the bit pattern de- smallest pyramid layer contains 17,000 pixels. Therefore, scribing which side of each plane it falls on. This 11-bit MMOD scans approximately 930,000 windows per nega- numberthenindicatesthevisual-word’sbin. tiveimageonthetestingdata. 6 We use a sliding window 56 pixels wide and 120 tall. Theentireoptimizationtakes116minutesandrunsfor220 iterations. Our results are compared to previous methods in Figure 3. The detection tradeoff curve shows that our methodachievessuperiorperformanceeventhoughweuse a basic bag-of-visual-word feature set while more recent work has invested heavily in improved feature representations. Therefore,weattributetheincreasedimprovementto ourtrainingprocedure. 5.3.FDDB Finally, we evaluate our method on the Face Detection Data Set and Benchmark (FDDB) challenge. This chal- lenging dataset contains images of human faces in multi- Figure 5. A comparison between our HOG filter learned via ple poses captured in indoor and outdoor settings. To test MMODandthreeothertechniques,includinganotherHOGfilter MMOD, we used it to learn a basic HOG sliding window methodlearnedusingtraditionalmeans. TheMMODprocedure classifier. Thereforethefeatureextractor(φ)takesinawin- resultsinamuchmoreaccurateHOGfilter. dow and outputs a HOG vector describing the entire win- dowaswasdoneinDalalandTriggs’sseminalpaper[3].To illustrate the learned model, the HOG filter resulting from FDDBevaluationsoftware.Exampleimageswithdetection thefirstFDDBfoldisvisualizedinFigure4. outputsarealsoshowninFigure6. In Figure 5 we see that the HOG filter learned via MMOD substantially outperforms a HOG filter learned with the typical linear SVM “hard negative mining” approach[12]aswellastheclassicViolaJonesmethod[19]. Moreover, oursingleHOGfilterlearnedviaMMODgives aslightlybetteraccuracythanthecomplexdeformablepart modelofZhu[22]. 6.Conclusion Figure4.TheHOGfilterlearnedviaMMODfromthefirstfoldof theFDDBdataset. Thefiltersfromotherfoldslooknearlyidenti- We introduced a new method for learning to detect ob- cal. jects in images. In particular, our method leads to a convex optimization and we provided an efficient algorithm During learning, the parameters were set as follows: for its solution. We tested our approach on three publicly C =50,ε=0.01C,L =1,andL =1. Wealsoup- availabledatasets,theINRIApersondataset,TUDarmstadt fa miss sampledeachimagebyafactoroftwosothatsmallerfaces cows,andFDDBusingtwofeaturerepresentations. Onall couldbedetected. SinceourHOGfilterboxis80x80pix- datasets,usingMMODtofindtheparametersofthedetec- els in size this upsampling allows us to detect images that torleadtosubstantialimprovements. are larger than about 40x40 pixels in size. Additionally, Our results on FDDB are most striking as we showed wemirroredthedataset,effectivelydoublingthenumberof thatasinglerigidHOGfiltercanbeatastate-of-the-artde- trainingimages. Thisleadstotrainingsizesofabout5000 formable part model when the HOG filter is learned via images per fold and our optimizer requires approximately MMOD.Weattributeoursuccesstothelearningmethod’s 25minutesperfold. abilitytomakefulluseofthedata. Inparticular,onFDDB, To perform detection, this single HOG filter is scanned ourmethodcanefficientlymakeuseofall300millionslid- over the image at each level of an image pyramid and any ing window positions during training. Moreover, MMOD windows which pass a threshold test are output after non- optimizestheoverallaccuracyoftheentiredetector,taking maxsuppressionisperformed. AROCcurvethatcompares into account information which is typically ignored when thislearnedHOGfilteragainstothermethodsiscreatedby trainingadetector. Thisincludeswindowswhichpartially sweeping this threshold and can be seen in Figure 5. To overlap target windows as well as the non-maximum sup- create the ROC curve we followed the FDDB evaluation pressionstrategyusedinthefinaldetector. protocol of performing 10 fold cross-validation and com- Ourmethodcurrentlyusesalinearwindowscoringfunc- biningtheresultsinasingleROCcurveusingtheprovided tion. Future research will focus on extending this method 7 Figure6.ExampleimagesfromtheFDDBdataset.TheredboxesshowthedetectionsfromHOGfilterslearnedusingMMOD.TheHOG filterswerenottrainedontheimagesshown. to use more-complex scoring functions, possibly by using nications of the ACM, Research Highlight, 52(11):97–104, kernels. The work of Yu and Joachims is a good start- November2009. ing point [21]. Additionally, while our approach was in- [11] D.E.King. Dlib-ml:Amachinelearningtoolkit. Journalof troduced for 2D sliding window problems, it may also be MachineLearningResearch,10:1755–1758,2009. useful for 1D sliding window detection applications, such [12] M.Ko¨stinger,P.Wohlhart,P.M.Roth,andH.Bischof. Ro- asthoseappearinginthespeechandnaturallanguagepro- bustfacedetectionbysimplemeans. InDAGM2012CVAW workshop. cessingdomains. Finally,toencouragefutureresearch,we [13] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of havemadeacarefulandthoroughlydocumentedimplemen- features: Spatial pyramid matching for recognizing natural tation of our method available as part of the open source scenecategories. InComputerVisionandPatternRecogni- dlib1machinelearningtoolbox[11]. tion(CVPR),2006. [14] D. R. Magee and R. D. Boyle. Detecting lameness using References ‘re-samplingcondensation’and‘multi-streamcyclichidden [1] Y.Altun,I.Tsochantaridis,andT.Hofmann.Hiddenmarkov markovmodels’. ImageandVisionComputing,20,2002. support vector machines. In International Conference on [15] J.C.Platt.Fasttrainingofsupportvectormachinesusingse- MachineLearning(ICML),2003. quentialminimaloptimization.InAdvancesinKernelMeth- [2] M. B. Blaschko and C. H. Lampert. Learning to localize ods—SupportVectorLearning,pages185–208,Cambridge, objectswithstructuredoutputregression. InEuropeanCon- MA,1998.MITPress. ferenceonComputerVision(ECCV),2008. [16] W.R.Schwartz,A.Kembhavi,D.Harwood,andL.S.Davis. [3] N.DalalandB.Triggs. Histogramsoforientedgradientsfor Humandetectionusingpartialleastsquaresanalysis. InIn- humandetection. InComputerVisionandPatternRecogni- ternationalConferenceonComputerVision(ICCV),2009. tion(CVPR),pages886–893,2005. [17] C.H.Teo,S.Vishwanthan,A.J.Smola,andQ.V.Le.Bundle [4] G. Duan, C. Huang, H. Ai, and S. Lao. Boosting associ- methodsforregularizedriskminimization. J.Mach.Learn. atedpairingcomparisonfeaturesforpedestriandetection. In Res.,11:311–365,March2010. NinthIEEEInternationalWorkshoponVisualSurveillance, [18] O.Tuzel,F.Porikli,andP.Meer. Humandetectionviaclas- 2009. sificationonreimannianmanifolds. InComputerVisionand [5] R.Fletcher.PracticalMethodsofOptimization,SecondEdi- PatternRecognition(CVPR),2007. tion. JohnWiley&Sons,1987. [19] P.ViolaandM.Jones.Rapidobjectdetectionusingaboosted [6] C.HuangandR.Nevatia. Highperformanceobjectdetec- cascadeofsimplefeatures. InComputerVisionandPattern tion by collaborative learning of joint ranking of granules Recognition,2001. features.InComputerVisionandPatternRecognition,2010. [20] B. Wu and R. Nevatia. Segmentation of multiple partially [7] P.IndykandR.Motwani. Approximatenearestneighbors: occludedobjectsbygroupingmergingassigningpartdetec- towardsremovingthecurseofdimensionality. In30thAnn. tionresponses. InComputerVisionandPatternRecognition ACMSymp.onTheoryofComputing,1998. (CVPR),2008. [8] V.JainandE.Learned-Miller. Fddb: Abenchmarkforface [21] C.J.YuandT.Joachims. Trainingstructuralsvmswithker- detectioninunconstrainedsettings. TechnicalReportUM- nels using sampled cuts. In ACM SIGKDD Conference on CS-2010-009,UniversityofMassachusetts,Amherst,2010. KnowledgeDiscoveryandDataMining,2008. [9] T.Joachims,T.Finley,andC.J.Yu. Cutting-planetraining [22] X. Zhu and D. Ramanan. Face detection, pose estimation, ofstructuralsvms. MachineLearning,77(1):27–59,2009. andlandmarklocalizationinthewild. InComputerVision andPatternRecognition,2012. [10] T.Joachims,T.Hofmann,Y.Yue,andC.J.Yu. Predicting structured objects with support vector machines. Commu- 1Softwareavailableathttp://dlib.net/ml.html#structuralobjectdetectiontrainer 8