appearing in Int. J. Comput. Vis.; content may change prior to final publication. Training Effective Node Classifiers for Cascade Classification Chunhua Shen · Peng Wang · Sakrapee Paisitkriangkrai · Anton van den Hengel 3 December2012 1 0 2 Abstract Cascade classifiers are widely used in real- 1 Introduction n timeobjectdetection.Differentfromconventionalclas- a sifiers that are designed for a low overall classification Real-time object detection inherently involves search- J error rate, a classifier in each node of the cascade is ing a large number of candidate image regions for a 0 1 required to achieve an extremely high detection rate small number of objects. Processing a single image, for and moderate false positive rate. Although there are example, can require the interrogation of well over a ] a few reported methods addressing this requirement million scanned windows in order to uncover a single V in the context of object detection, there is no princi- correct detection. This imbalance in the data has an C pled feature selection method that explicitly takes into impact on the way that detectors are applied, but also . s account this asymmetric node learning objective. We on the training process. This impact is reflected in the c [ provide such an algorithm here. We show that a spe- need to identify discriminative features from within a cialcaseofthebiasedminimaxprobabilitymachinehas large over-complete feature set. 1 thesameformulationasthelinearasymmetricclassifier Cascade classifiers have been proposed as a poten- v 2 (LAC)ofWuetal.(2005).Wethendesignanewboost- tial solution to the problem of imbalance in the data 3 ing algorithm that directly optimizes the cost function (Viola and Jones 2004; Bi et al. 2006; Dundar and Bi 0 of LAC. The resulting totally-corrective boosting algo- 2007; Brubaker et al. 2008; Wu et al. 2008), and have 2 rithm is implemented by the column generation tech- receivedsignificantattentionduetotheirspeedandac- . 1 nique in convex optimization. Experimental results on curacy. In this work, we propose a principled method 0 objectdetectionverifytheeffectivenessoftheproposed bywhichtotrainaboosting-basedcascadeofclassifiers. 3 1 boosting algorithm as a node classifier in cascade ob- The boosting-based cascade approach to object de- v: ject detection, and show performance better than that tection was introduced by Viola and Jones (Viola and i of the current state-of-the-art. Jones 2004; 2002), and has received significant subse- X quent attention (Li and Zhang 2004; Pham and Cham r a 2007b; Pham et al. 2008; Paisitkriangkrai et al. 2008; Keywords AdaBoost · Minimax Probability Ma- Shen et al. 2008; Paisitkriangkrai et al. 2009). It also chine · Cascade Classifier · Object Detection · Human underpins the current state-of-the-art (Wu et al. 2005; Detection 2008). The Viola and Jones approach uses a cascade of increasingly complex classifiers, each of which aims to C.Shen((cid:0))·P.Wang·S.Paisitkriangkrai·A.vandenHen- achieve the best possible classification accuracy while gel achieving an extremely low false negative rate. These Australian Centre for Visual Technologies, and School of classifiers can be seen as forming the nodes of a de- ComputerScience,TheUniversityofAdelaide,SA5005,Aus- generate binary tree (see Fig. 1) whereby a negative tralia result from any single such node classifier terminates E-mail:[email protected] This work was in part supported by Australian Research the interrogation of the current patch. Viola and Jones CouncilFutureFellowshipFT120100969. use AdaBoost to train each node classifier in order to 2 ChunhuaShenetal. achieve the best possible classification accuracy. A low at all. We conjecture that further improvement may be false negative rate is achieved by subsequently adjust- gained if the node learning objective is explicitly taken ing the decision threshold until the desired false nega- into account at both steps. We thus propose new boost- tiverateisachieved.Thisprocesscannotbeguaranteed ing algorithms to implement this idea and verify this to produce the best detection performance for a given conjecture.Apreliminaryversionofthisworkwaspub- false negative rate. lished in Shen et al. (2010). Under the assumption that each node of the cas- Our major contributions are as follows. cade classifier makes independent classification errors, 1. Starting from the theory of minimax probability the detection rate and false positive rate of the entire cascade are: F = (cid:81)N d and F = (cid:81)N f , respec- machines (MPMs), we derive a simplified version dr t=1 t fp t=1 t of the biased minimax probability machine, which tively, where d represents the detection rate of classi- t has the same formulation as the linear asymmet- fiert,f thecorrespondingfalsepositiverateandN the t ric classifier of Wu et al. (2005). We thus show the number of nodes. As pointed out in (Viola and Jones underlyingconnectionbetweenMPMandLAC.Im- 2004; Wu et al. 2005), these two equations suggest a portantly, this new interpretation weakens some of node learning objective: Each node should have an ex- the restrictions on the acceptable input data distri- tremelyhighdetectionrated (e.g.,99.7%)andamod- t bution imposed by LAC. erate false positive rate f (e.g., 50%). With the above t 2. Wedevelopnewboosting-likealgorithmsbydirectly values of d and f , and a cascade of N = 20 nodes, t t minimizingtheobjectivefunctionofthelinearasym- then F ≈ 94% and F ≈ 10−6, which is a typical dr fp metric classifier, which results in an algorithm that design goal. we label LACBoost. We also propose FisherBoost One drawback of the standard AdaBoost approach on the basis of Fisher LDA rather than LAC. Both toboostingisthatitdoesnottakeadvantageofthecas- methods may be used to identify the feature set cade classifier’s special structure. AdaBoost only mini- thatoptimallyachievesthenodelearninggoalwhen mizes the overall classification error and does not par- training a cascade classifier. To our knowledge, this ticularlyminimizethenumberoffalsenegatives.Inthis isthefirstattempttodesignsuchafeatureselection sense, the features selected by AdaBoost are not opti- method. mal for the purpose of rejecting as many negative ex- 3. LACBoost and FisherBoost share similarities with amples as possible. Viola and Jones proposed a solu- LPBoost(Demirizetal.2002)inthesensethatboth tion to this problem in AsymBoost (Viola and Jones use column generation—a technique originally pro- 2002)(andits variants (PhamandCham2007b; Pham posedforlarge-scalelinearprogramming(LP).Typ- etal.2008;Wangetal.2012;Masnadi-ShiraziandVas- ically, the Lagrange dual problem is solved at each concelos 2007)) by modifying the loss function so as iteration in column generation. We instead solve to more greatly penalize false negatives. AsymBoost the primal quadratic programming (QP) problem, achievesbetterdetectionratesthanAdaBoost,butstill which has a special structure and entropic gradient addresses the node learning goal indirectly, and cannot (EG) can be used to solve the problem very effi- be guaranteed to achieve the optimal solution. ciently.Comparedwithgeneralinterior-pointbased Wu et al. explicitly studied the node learning goal QP solvers, EG is much faster. andproposedtouselinearasymmetricclassifier(LAC) 4. We apply LACBoost and FisherBoost to object de- andFisherlineardiscriminantanalysis(LDA)toadjust tectionandbetterperformanceisobservedoveroth- the weights on a set of features selected by AdaBoost ermethods(Wuetal.2005;2008;Majietal.2008). or AsymBoost (Wu et al. 2005; 2008). Their experi- In particular on pedestrian detection, FisherBoost ments indicated that with this post-processing tech- achieves the state-of-the-art, comparing with meth- nique the node learning objective can be better met, ods listed in (Doll´ar et al. 2012) on three bench- which is translated into improved detection rates. In mark datasets. The results confirm our conjecture Viola and Jones’ framework, boosting is used to select andshowtheeffectivenessofLACBoostandFisher- featuresandatthesametimetotrainastrongclassifier. Boost. These methods can be immediately applied Wu et al.’s work separates these two tasks: AdaBoost to other asymmetric classification problems. or AsymBoost is used to select features; and as a sec- ond step, LAC or LDA is used to construct a strong Moreover, we analyze the condition that makes the classifier by adjusting the weights of the selected fea- validity of LAC, and show that the multi-exit cascade tures.Thenodelearningobjectiveisonlyconsideredat might be more suitable for applying LAC learning of the second step. At the first step—feature selection— Wu et al. (2005) and Wu et al. (2008) (and our LAC- the node learning objective is not explicitly considered Boost) rather than Viola-Jones’ conventional cascade. Training EffectiveNodeClassifiersforCascadeClassification 3 AsobservedinWuetal.(2008),inmanycases,LDA the multi-exit cascade more accurately fulfills the LAC evenperformsbetterthanLAC.Inourexperiments,we requirementthatthemarginbedrawnfromaGaussian havealsoobservedsimilarphenomena.Paisitkriangkrai distribution. et al. (2009) empirically showed that LDA’s criterion In addition to improving the cascade structure, a can be used to achieve better detection results. An ex- number of improvements have been made on the learn- planationofwhyLDAworkssowellforobjectdetection ing algorithm for building node classifiers in a cascade. is missing in the literature. Here we demonstrate that Wuetal.,forexample,usefastforwardfeatureselection in the context of object detection, LDA can be seen as to accelerate the training procedure (Wu et al. 2003). a regularized version of LAC in approximation. Wu et al. (2005) also showed that LAC may be used TheproposedLACBoost/FisherBoostalgorithmdif- to deliver better classification performance. Pham and fersfromtraditionalboostingalgorithmsinthatitdoes Cham recently proposed online asymmetric boosting not minimize a loss function. This opens new possibil- that considerably reduces the training time required ities for designing boosting-like algorithms for special (Pham and Cham 2007b). By exploiting the feature purposes.Wehavealsoextendedcolumngenerationfor statistics, Pham and Cham (2007a) have also designed optimizing nonlinear optimization problems. Next we a fast method to train weak classifiers. Li and Zhang review related work in the context of real-time object (2004) proposed FloatBoost, which discards redundant detection using cascade classifiers. weakclassifiersduringAdaBoost’sgreedyselectionpro- cedure. Masnadi-Shirazi and Vasconcelos (2011) pro- posed cost-sensitive boosting algorithms which can be 1.1 Related Work applied to different cost-sensitive losses by means of gradient descent. Liu and Shum (2003) also proposed The field of object detection has made a significant KLBoost, aiming to select features that maximize the progress over the last decade, especially after the sem- projected Kullback-Leibler divergence and select fea- inal work of Viola and Jones. Three key components tureweightsbyminimizingtheclassificationerror.Pro- thatcontributetotheirfirstrobustreal-timeobjectde- mising results have also been reported by LogitBoost tection framework are: (Tuzel et al. 2008) that employs the logistic regression 1. The cascade classifier, which efficiently filters out loss, and GentleBoost (Torralba et al. 2007) that uses negativepatchesinearlynodeswhilemaintaininga adaptiveNewtonstepstofittheadditivemodel.Multi- very high detection rate; instance boosting has been introduced to object detec- 2. AdaBoost that selects informative features and at tion (Viola et al. 2005; Doll´ar et al. 2008; Lin et al. the same time trains a strong classifier; 2009), which does not require precisely labeled loca- 3. The use of integral images, which makes the com- tions of the targets in training data. putation of Haar features extremely fast. New features have also been designed for improv- This approach has received significant subsequent at- ing the detection performance. Viola and Jones’ Haar tention.Anumberofalternativecascadeshavebeende- featuresarenotsufficientlydiscriminativefordetecting velopedincludingthesoftcascade(BourdevandBrandt more complex objects like pedestrians, or multi-view 2005), WaldBoost (Sochman and Matas 2005), the dy- faces. Covariance features (Tuzel et al. 2008) and his- namiccascade(Xiaoetal.2007),theAND-ORcascade togram of oriented gradients (HOG) (Dalal and Triggs (Dundar and Bi 2007), the multi-exit cascade (Pham 2005) have been proposed in this context, and efficient etal.2008),thejointcascade(LefakisandFleuret2010) implementation approaches (along the lines of integral and recently proposed, the rate constraint embedded images) are developed for each. Shape context, which cascade(RCECBoost)(SaberianandVasconcelos2012). can also exploit integral images (Aldavert et al. 2010), In this work we have adopted the multi-exit cascade of wasappliedtohumandetectioninthermalimages(Wa- Pham et al. due to its effectiveness and efficiency as ngetal.2010).Thelocalbinarypattern(LBP)descrip- demonstratedinPhametal.(2008).Themulti-exitcas- tor and its variants have been shown promising per- cade improves classification performance by using the formance on human detection (Mu et al. 2008; Zheng results of all of the weak classifiers applied to a patch et al. 2010). Recently, effort has been spent on com- so far in reaching a decision at each node of the tree bining complementary features, including: simple con- (see Fig. 1). Thus the n-th node classifier uses the re- catenation of HOG and LBP (Wang et al. 2007), com- sultsoftheweakclassifiersassociatedwithnoden,but bination of heterogeneous local features in a boosted alsothoseassociatedwiththepreviousn−1nodeclas- cascadeclassifier(WuandNevatia2008),andBayesian sifiers in the cascade. We show below that LAC post- integration of intensity, depth and motion features in a processingcanenhancethemulti-exitcascade,andthat mixture-of-experts model (Enzweiler et al. 2010). 4 ChunhuaShenetal. h1,h2,··· hj,hj+1,··· ···,hn−1,hn input T T T T target 1 2 N F F F h1,h2,··· hj,hj+1,··· ···,hn−1,hn input T T T T target 1 2 N F F F Fig. 1: Cascade classifiers. The first one is the standard cascade of Viola and Jones (2004). The second one is the multi-exit cascade proposed in Pham et al. (2008). Only those classified as true detection by all nodes will be true targets. The rest of the paper is organized as follows. We Define the matrix HZ ∈ Rm×n such that the (i,j) briefly review the concept of minimax probability ma- entry HZ =h (x ) is the label predicted by weak clas- ij j i chine and derive the new simplified version of biased sifierh (·)forthedatumx ,wherex theithelementof j i i minimaxprobabilitymachineinSection2.Linearasym- thesetZ.Inordertosimplifythenotationweeliminate metric classification and its connection to the minimax thesuperscriptwhenZ isthetrainingset,soHZ =H. probability machine is discussed in Section 3. In Sec- Therefore, each column H of the matrix H consists :j tion4,weshowhowtodesignnewboostingalgorithms of the output of weak classifier h (·) on all the train- j (LACBoost and FisherBoost) by rewriting the opti- ingdata;whileeachrowH containstheoutputsofall i: mization formulations of LAC and Fisher LDA. The weak classifiers on the training datum x . Define simi- i new boosting algorithms are applied to object detec- larly the matrix A ∈ Rm×n such that A = y h (x ). ij i j i tion in Section 5 and we conclude the paper in Section Note that boosting algorithms entirely depends on the 6. matrix A and do not directly interact with the train- ingexamples.Ourfollowingdiscussionwillthuslargely focus on the matrix A. We write the vector obtained by multiplying a matrix A with a vector w as Aw and 1.2 Notation its ith entry as (Aw) . If we let w represent the co- i efficients of a selected weak classifier then the margin Thefollowingnotationisused.Amatrixisdenotedbya of the training datum x is ρ = A w = (Aw) and bold upper-case letter (X); a column vector is denoted i i i: i the vector of such margins for all of the training data by a bold lower-case letter (x). The ith row of X is is ρ=Aw. denoted by X and the ith column X . The identity i: :i matrixisIanditssizeshouldbeclearfromthecontext. 1 and 0 are column vectors of 1’s and 0’s, respectively. 2 Minimax Probability Machines We use (cid:60),(cid:52) to denote component-wise inequalities. Let T = {(xi,yi)}i=1,···,m be the set of training Beforeweintroduceourboostingalgorithm,letusbriefly data, where xi ∈ X and yi ∈ {−1,+1}, ∀i. The train- review the concept of minimax probability machines ing set consists of m1 positive training points and m2 (MPM) (Lanckriet et al. 2002) first. negative ones; m +m = m. Let h(·) ∈ H be a weak 1 2 classifierthatprojectsaninputvectorxinto{−1,+1}. Notethathereweconsideronlyclassifierswithdiscrete 2.1 Minimax Probability Classifiers outputs although the developed methods can use real- valued weak classifiers too. We assume that H, the set Let x ∈ Rn and x ∈ Rn denote two random vectors 1 2 fromwhichh(·)isselected,isfiniteandhasnelements. drawn from two distributions with means and covari- Training EffectiveNodeClassifiersforCascadeClassification 5 ances(µ ,Σ )and(µ ,Σ ),respectively.Hereµ ,µ ∈ 2.3 Simplified Biased Minimax Probability Machines 1 1 2 2 1 2 RnandΣ ,Σ ∈Rn×n.Wedefinetheclasslabelsofx 1 2 1 andx as+1and−1,w.l.o.g.Theminimaxprobability Inthissection,weareinterestedinsimplifyingtheprob- 2 machine (MPM) seeks a robust separation hyperplane lem of (2) for a special case of γ = 0.5, due to its ◦ thatcanseparatethetwoclassesofdatawiththemax- important application in object detection (Viola and imal probability. The hyperplane can be expressed as Jones 2004; Wu et al. 2005). In the following discus- w(cid:62)x=b with w ∈Rn\{0} and b∈R. The problem of sion, for simplicity, we only consider γ =0.5 although ◦ identifyingtheoptimalhyperplanemaythenbeformu- some algorithms developed may also apply to γ <0.5. ◦ lated as Theoreticalresultsin(Yuetal.2009)showthat,the worst-case constraint in (2) can be written in different (cid:20) (cid:21) forms when x follows arbitrary, symmetric, symmetric max γ s.t. inf Pr{w(cid:62)x ≥b} ≥γ, (1) w,b,γ x1∼(µ1,Σ1) 1 unimodal or Gaussian distributions (see Appendix A). (cid:20) (cid:21) Both the MPM (Lanckriet et al. 2002) and the biased inf Pr{w(cid:62)x ≤b} ≥γ. 2 MPM (Huang et al. 2004) are based the most general x2∼(µ2,Σ2) formofthefourcasesshowninAppendixA,i.e.,Equa- tion (27) for arbitrary distributions, as they do not im- Hereγ isthelowerboundoftheclassificationaccuracy pose constraints upon the distributions of x and x . (ortheworst-caseaccuracy)ontestdata.Thisproblem 1 2 However, one may take advantage of structural in- canbetransformedintoaconvexproblem,morespecif- formation whenever available. For example, it is shown ically a second-order cone program (SOCP) (Boyd and in(Wuetal.2005)that,forthefacedetectionproblem, Vandenberghe 2004) and thus can be solved efficiently weakclassifieroutputscanbewellapproximatedbythe (Lanckriet et al. 2002). Gaussiandistribution.Inotherwords,theconstraintfor arbitrary distributions does not utilize any type of a prioriinformation,andhence,formanyproblems,con- 2.2 Biased Minimax Probability Machines sidering arbitrary distributions for simplifying (1) and (2)istooconservative.SinceboththeMPM(Lanckriet Theformulation(1)assumesthattheclassificationprob- etal.2002)andthebiasedMPM(Huangetal.2004)do lem is balanced. It attempts to achieve a high recog- not assume any constraints on the distribution family, nition accuracy, which assumes that the losses associ- they fail to exploit this structural information. ated with all mis-classifications are identical. However, Letusconsiderthespecialcaseofγ =0.5.Itiseasy ◦ in many applications this is not the case. to see that the worst-case constraint in (2) becomes a Huang et al. (2004) proposed a biased version of simple linear constraint for symmetric, symmetric uni- MPM through a slight modification of (1), which may modal,aswellasGaussiandistributions(seeAppendix be formulated as A). As pointed out in (Yu et al. 2009), such a result istheimmediateconsequenceofsymmetrybecausethe (cid:20) (cid:21) max γ s.t. inf Pr{w(cid:62)x ≥b} ≥γ, (2) worst-case distributions are forced to put probability 1 w,b,γ x1∼(µ1,Σ1) mass arbitrarily far away on both sides of the mean. (cid:20) (cid:21) In such case, any information about the covariance is inf Pr{w(cid:62)x ≤b} ≥γ . 2 ◦ neglected. x2∼(µ2,Σ2) We now apply this result to the biased MPM as Here γ ∈ (0,1) is a prescribed constant, which is the represented by (2). Our main result is the following ◦ acceptableclassificationaccuracyforthelessimportant theorem. class. The resulting decision hyperplane prioritizes the Theorem 1 With γ = 0.5, the biased minimax prob- classification of the important class x over that of the ◦ 1 lem (2)canbeformulatedasanunconstrainedproblem: less important class x . Biased MPM is thus expected 2 to perform better in biased classification applications. Huangetal.showedthat(2)canbeiterativelysolved w(cid:62)(µ −µ ) max 1 2 , (3) via solving a sequence of SOCPs using the fractional (cid:112) w w(cid:62)Σ w 1 programming(FP)technique.Clearlyitissignificantly morecomputationallydemandingtosolve(2)than(1). under the assumption that x follows a symmetric dis- 2 Nextweshowhowtore-formulate(2)intoasimpler tribution. The optimal b can be obtained through: quadraticprogram(QP)basedontherecenttheoretical results in (Yu et al. 2009). b=w(cid:62)µ . (4) 2 6 ChunhuaShenetal. Theworst-caseclassificationaccuracyforthefirstclass, Weseeka{w,b}pairwithaveryhighaccuracyonthe γ(cid:63), is obtained by solving positive data x and a moderate accuracy on the nega- 1 tivex .Thiscanbeexpressedasthefollowingproblem: −b(cid:63)+a(cid:63)(cid:62)µ 2 ϕ(γ(cid:63))= 1, (5) (cid:112)w(cid:63)(cid:62)Σ1w(cid:63) wm(cid:54)=a0x,b x1∼(Pµr1,Σ1){w(cid:62)x1 ≥b}, where s.t. Pr {w(cid:62)x ≤b}=λ. (7) 2 (cid:113) γ if x ∼(µ ,Σ ), x2∼(µ2,Σ2) ϕ(γ)=(cid:113)12−(1γ1−γ) if x11 ∼(µ11,Σ11)S, (6) Itnha(tWfoureatnayl.w2,00w5(cid:62)),xλ1 iiss Gseatutsosia0n.5aannddwit(cid:62)ixs2aisssusmymed- (cid:113) φ23−1(2γ(1)1−γ) iiff xx1 ∼∼(Gµ(µ1,Σ,Σ1)S)U., masesturmicp,t(i7o)nscamnaybebaeprperloaxxiemdaatsedwbeyha(3v)e.sAhogwainn,inthtehsee 1 1 1 last section. Problem (3) is similar to LDA’s optimiza- and {w(cid:63),b(cid:63)} is the optimal solution of (3) and (4). tion problem PleaserefertoAppendixAfortheproofofTheorem1. w(cid:62)(µ −µ ) max 1 2 . (8) (cid:112) We have derived the biased MPM algorithm from a w(cid:54)=0 w(cid:62)(Σ +Σ )w 1 2 different perspective. We reveal that only the assump- Problem (3) can be solved by eigen-decomposition and tion of symmetric distributions is needed to arrive at a a closed-form solution can be derived: simple unconstrained formulation. Compared with the approach in (Huang et al. 2004), we have used more w(cid:63) =Σ−1(µ −µ ), b(cid:63) =w(cid:63)(cid:62)µ . (9) 1 1 2 2 information to simply the optimization problem. More importantly, as will be shown in the next section, this Ontheotherhand,eachnodeincascadedboostingclas- unconstrained formulation enables us to design a new sifiers has the following form: boosting algorithm. f(x)=sign(w(cid:62)H(x)−b). (10) There is a close connection between our algorithm andthelinearasymmetricclassifier(LAC)in(Wuetal. We override the symbol H(x) here, which denotes the 2005).Theresultingproblem(3)isexactlythesameas output vector of all weak classifiers over the datum x. LAC in (Wu et al. 2005). Removing the inequality in We can cast each node as a linear classifier over the this constraint leads to a problem solvable by eigen- feature space constructed by the binary outputs of all decomposition. We have thus shown that the results of weakclassifiers.Foreachnodeinacascadeclassifier,we Wu et al. may be generalized from the Gaussian dis- wish to maximize the detection rate while maintaining tributions assumed in (Wu et al. 2005) to symmetric thefalsepositiverateatamoderatelevel(forexample, distributions. around 50.0%). That is to say, the problem (3) repre- sents the node learning goal. Boosting algorithms such as AdaBoost can be used as feature selection methods, 3 Linear Asymmetric Classification andLACisusedtolearnalinearclassifieroverthosebi- naryfeatureschosenbyboostingasinWuetal.(2005). We have shown that starting from the biased minimax The advantage of this approach is that LAC considers probability machine, we are able to obtain the same the asymmetric node learning explicitly. optimization formulation as shown in Wu et al. (2005), However, there is a precondition on the validity of whilemuchweakeningtheunderlyingassumption(sym- LAC that for any w, w(cid:62)x is a Gaussian and w(cid:62)x is metricdistributionsversusGaussiandistributions).Be- 1 2 symmetric.Inthecaseofboostingclassifiers,w(cid:62)x and fore we propose our LACBoost and FisherBoost, how- 1 w(cid:62)x can be expressed as the margin of positive data ever, we provide a brief overview of LAC. 2 and negative data, respectively. Empirically Wu et al. Wu et al. (2008) proposed linear asymmetric clas- (2008)verifiedthatw(cid:62)xisapproximatelyGaussianfor sification (LAC) as a post-processing step for training a cascade face detector. We discuss this issue in more nodes inthe cascade framework.In (Wu et al.2008), it detail in Section 5. Shen and Li (2010b) theoretically is stated that LAC is guaranteed to reach an optimal proved that under the assumption that weak classifiers solution under the assumption of Gaussian data distri- are independent, the margin of AdaBoost follows the butions.WenowknowthatthisGaussianalitycondition Gaussian distribution, as long as the number of weak may be relaxed. classifiersissufficientlylarge.InSection5weverifythis Suppose that we have a linear classifier theoretical result by performing the normality test on f(x)=sign(w(cid:62)x−b). nodes with different number of weak classifiers. Training EffectiveNodeClassifiersforCascadeClassification 7 4 Constructing Boosting Algorithms from LDA to the m negative data. We now see that µ −µ = 2 1 2 and LAC e(cid:62)ρ, C = m /m · Σ + m /m · Σ with Σ the w 1 1 2 2 1,2 covariance matrices. Noting that In kernel methods, the original data are nonlinearly mappedtoafeaturespacebyamappingfunctionΨ(·). 1 (cid:88) w(cid:62)Σ w= (ρ −ρ )2, The function need not be known, however, as rather 1,2 m (m −1) i k 1,2 1,2 than being applied to the data directly, it acts instead i>k,yi=yk=±1 through the inner product Ψ(x )(cid:62)Ψ(x ). In boosting i j we can easily rewrite the original problem (11) (and (R¨atsch et al. 2002), however, the mapping function (12)) into: can be seen as being explicitly known, as Ψ(x) : x (cid:55)→ [h (x),...,h (x)].LetusconsidertheFisherLDAcase 1 n firstbecausethesolutiontoLDAwillgeneralizetoLAC min 1ρ(cid:62)Qρ−θe(cid:62)ρ, w,ρ 2 straightforwardly, by looking at the similarity between s.t. w(cid:60)0,1(cid:62)w=1, (3) and (8). ρ =(Aw) ,i=1,··· ,m. (16) Fisher LDA maximizes the between-class variance i i and minimizes the within-class variance. In the binary- (cid:20) (cid:21) class case, the more general formulation in (8) can be Q 0 Here Q= 1 is a block matrix with expressed as 0 Q 2 (µ −µ )2 w(cid:62)C w max 1 2 = b , (11) 1 − 1 ...− 1 w σ1+σ2 w(cid:62)Cww m m(m1−1) m(m1−1) − 1 1 ...− 1 wclahsesrescCabttearndmCatwricaerse; µtheabnedtwµeen-acrleassthaendprowjietchtiend- Q1 = m(m...1−1) m... ... m(m...1−1), 1 2 centers of the two classes. The above problem can be − 1 − 1 ... 1 m(m1−1) m(m1−1) m equivalently reformulated as andQ issimilarlydefinedbyreplacingm withm in min w(cid:62)C w−θ(µ −µ ), (12) 2 1 2 w w 1 2 Q1: for some certain constant θ and under the assumption that µ −µ ≥0.1 Now in the feature space, our data 1 − 1 ...− 1 1 2 m m(m2−1) m(m2−1) areΨ(x ),i=1...m.Definethevectorse,e ,e ∈Rm − 1 1 ...− 1 such thait e = e1 +e2, the i-th entry of e11is 12/m1 if Q2 = m(m...2−1) m... ... m(m...2−1). y = +1 and 0 otherwise, and the i-th entry of e is i 2 − 1 − 1 ... 1 1/m2 if yi =−1 and 0 otherwise. We then see that m(m2−1) m(m2−1) m µ = 1 w(cid:62) (cid:88) Ψ(x )= 1 (cid:88) A w Also note that we have introduced a constant 1 before 1 m i m i: 2 1 1 the quadratic term for convenience. The normalization yi=1 yi=1 1 (cid:88) constraint 1(cid:62)w = 1 removes the scale ambiguity of w. = (Aw) =e(cid:62)Aw, (13) m i 1 Without it the problem is ill-posed. 1 yi=1 We see from the form of (3) that the covariance of and thenegativedataisnotinvolvedinLACandthusthatif (cid:20) (cid:21) Q 0 µ2 = m1 w(cid:62) (cid:88) Ψ(xi)= m1 (cid:88) Hi:w=−e(cid:62)2Aw, wesetQ= 01 0 then(16)becomestheoptimization 2 yi=−1 2 yi=−1 problem of LAC. (14) Atthisstage,itremainsunclearabouthowtosolve the problem (16) because we do not know all the weak For ease of exposition we order the training data ac- cording to their labels so the vector e∈Rm: classifiers. There may be extremely (or even infinitely) many weak classifiers in H, the set from which h(·) is e=[1/m ,··· ,1/m ,···](cid:62), (15) selected, meaning that the dimension of the optimiza- 1 2 tion variable w may also be extremely large. So (16) andthefirstm componentsofρcorrespondtothepos- 1 is a semi-infinite quadratic program (SIQP). We show itive training data and the remaining ones correspond how column generation can be used to solve this prob- 1 In our object detection experiment, we found that this lem.Tomakecolumngenerationapplicable,weneedto assumptioncanalwaysbesatisfied. derive a specific Lagrange dual of the primal problem. 8 ChunhuaShenetal. 4.1 The Lagrange Dual Problem add the most violated constraint by solving the follow- ing problem: WenowderivetheLagrangedualofthequadraticprob- m lem (16). Although we are only interested in the vari- h(cid:48)(·)=argmax (cid:88)u y h(x ). (20) h(·) i i i ablew,weneedtokeeptheauxiliaryvariableρinorder i=1 to obtain a meaningful dual problem. The Lagrangian This is exactly the same as the one that standard Ad- of (16) is aBoost and LPBoost use for producing the best weak classifier at each iteration. That is to say, to find the L(w,ρ,u,r)= 1ρ(cid:62)Qρ−θe(cid:62)ρ+u(cid:62)(ρ−Aw)−q(cid:62)w 2 weak classifier that has the minimum weighted train- (cid:124)(cid:123)(cid:122)(cid:125) (cid:124)(cid:123)(cid:122)(cid:125) primal dual ing error. We summarize the LACBoost/FisherBoost +r(1(cid:62)w−1), (17) algorithm in Algorithm 1. By simply changing Q2, Al- gorithm 1 can be used to train either LACBoost or with q(cid:60)0. sup inf L(w,ρ,u,r) gives the follow- FisherBoost.Notethattoobtainanactualstrongclas- u,r w,ρ ing Lagrange dual: sifier, one may need to include an offset b, i.e. the final classifier is (cid:80)n h (x)−b because from the cost func- j=1 j regularization tion of our algorithm (12), we can see that the cost (cid:122) (cid:125)(cid:124) (cid:123) function itself does not minimize any classification er- max −r− 1(u−θe)(cid:62)Q−1(u−θe), u,r 2 ror. It only finds a projection direction in which the m data can be maximally separated. A simple line search (cid:88) s.t. uiAi: (cid:52)r1(cid:62). (18) can find an optimal b. Moreover, when training a cas- i=1 cade, we need to tune this offset anyway as shown in (10). In our case, Q is rank-deficient and its inverse does The convergence of Algorithm 1 is guaranteed by not exist (for both LDA and LAC). Actually for both generalcolumngenerationorcutting-planealgorithms, Q and Q , they have a zero eigenvalue with the cor- 1 2 which is easy to establish: responding eigenvector being all ones. This is easy to see because for Q and Q , the sum of each row (or Theorem 2 Thecolumngenerationproceduredecreases 1 2 each column) is zero. We can simply regularize Q with theobjectivevalueofproblem (16)ateachiterationand Q+δ˜I with δ˜a small positive constant. Actually, Q is hence in the limit it solves the problem (16) globally to a diagonally dominant matrix but not strict diagonal a desired accuracy. dominance. So Q+δ˜I with any δ˜ > 0 is strict diago- The proof is deferred to Appendix B. In short, when nal dominance and by the Gershgorin circle theorem, a a new h(cid:48)(·) that violates dual feasibility is added, the strictlydiagonallydominantmatrixmustbeinvertible. new optimal value of the dual problem (maximization) One of the KKT optimality conditions between the would decrease. Accordingly, the optimal value of its dual and primal is primal problem decreases too because they have the same optimal value due to zero duality gap. Moreover ρ(cid:63) =−Q−1(u(cid:63)−θe), (19) the primal cost function is convex, therefore in the end it converges to the global minimum. which can be used to establish the connection between At each iteration of column generation, in theory, thedualoptimumandtheprimaloptimum.Thisisob- wecansolveeitherthedual(18)ortheprimalproblem tained by the fact that the gradient of L w.r.t. ρ must (16). Here we choose to solve an equivalent variant of vanish at the optimum, ∂L/∂ρi =0, ∀i=1···n. the primal problem (16): Problem (18) can be viewed as a regularized LP- min 1w(cid:62)(A(cid:62)QA)w−(θe(cid:62)A)w, s.t.w∈∆ , (21) Boost problem. Compared with the hard-margin LP- w 2 n Boost (Demiriz et al. 2002), the only difference is the where ∆ is the unit simplex, which is defined as {w∈ regularization term in the cost function. The duality n Rn :1(cid:62)w=1,w(cid:60)0}. gap between the primal (16) and the dual (18) is zero. In practice, it could be much faster to solve (21) In other words, the solutions of (16) and (18) coincide. since Insteadofsolving (16)directly,onecalculatesthemost violated constraint in (18) iteratively for the current 1. Generally, the primal problem has a smaller size, solution and adds this constraint to the optimization hence faster to solve. The number of variables of problem. In theory, any column that violates dual fea- (18) is m at each iteration, while the number of sibility can be added. To speed up the convergence, we variables is the number of iterations for the primal Training EffectiveNodeClassifiersforCascadeClassification 9 Algorithm 1 Column generation for SIQP. 1.5 Input:Labeledtrainingdata(xi,yi),i=1···m;termina- tion threshold ε > 0; regularization parameter θ; maximumnumberofiterations nmax. 1 Initialization: m=0;w=0;and ui= m1, i=1···m. 1 2 for iteration=1:nmax do 3 − Checkfortheoptimality: ifiteration>1 and (cid:80)mi=1uiyih(cid:48)(xi)<r+ε, x2 then 0.5 break;andtheproblem issolved; 4 − Add h(cid:48)(·) to the restricted master problem, which corresponds toanewconstraint inthedual; 0 5 − Solvethedualproblem(18)(ortheprimalproblem (16))andupdater andui (i=1···m). 6 − Incrementthenumberofweakclassifiersn=n+1. Output: Theselectedfeaturesareh1,h2,...,hn.Thefi- −0.5 0 x 0.5 1 1.5 nalstrongclassifieris:F(x)=(cid:80)nj=1wjhj(x)−b. 1 Heretheoffsetbcanbelearnedbyasimpleline 1.5 search. problem.Forexample,inViola-Jones’facedetection 1 framework,thenumberoftrainingdatam=10,000 andnmax =200.Inotherwords,theprimalproblem x2 has at most 200 variables in this case; 0.5 2. The dual problem (18) is a standard QP problem. It has no special structure to exploit. As we will 0 show, the primal problem (21) belongs to a special class of problems and can be efficiently solved us- ing entropic/exponentiated gradient descent (EG) −0.5 0 0.5 1 1.5 (Beck and Teboulle 2003; Collins et al. 2008). See x1 Appendix C for details of the EG algorithm. Fig. 2: Decision boundaries of AdaBoost (top) and AfastQPsolverisextremelyimportantfortraining FisherBoost (bottom) on 2D artificial data generated our object detector since we need to solve a few from the Gaussian distribution (positive data repre- thousand QP problems. Compared with standard sented by (cid:3)’s and negative data by ×’s). Weak classi- QP solvers like Mosek (MOSEK 2010), EG is much fiersareverticalandhorizontaldecisionstumps.Fisher- faster.EGmakesitpossibletotrainadetectorusing Boost emphasizes more on positive samples than neg- almost the same amount of time as using standard ative samples. As a result, the decision boundary of AdaBoost because the majority of time is spent on FisherBoost is more similar to the Gaussian distribu- weak classifier training and bootstrapping. tion than the decision boundary of AdaBoost. Wecanrecoverbothofthedualvariablesu(cid:63),r(cid:63) eas- ily from the primal variable w(cid:63),ρ(cid:63): 5 Experiments u(cid:63) =−Qρ(cid:63)+θe; (22) In this section, we perform our experiments on both r(cid:63) = max (cid:8)(cid:80)m u(cid:63)A (cid:9). (23) syntheticandchallengingreal-worlddatasets,e.g.,face j=1...n i=1 i ij and pedestrian detection. The second equation is obtained by the fact that in 5.1 Synthetic Testing thedualproblem’sconstraints,atoptimum,theremust exist at least one u(cid:63) such that the equality holds. That i We first illustrate the performance of FisherBoost on is to say, r(cid:63) is the largest edge over all weak classifiers. an asymmetrical synthetic data set where there are a In summary, when using EG to solve the primal largenumberofnegativesamplescomparedtothepos- problem, Line 5 of Algorithm 1 is: itive ones. Fig. 2 demonstrates the subtle difference in − Solve the primal problem (21) using EG, and up- classificationboundariesbetweenAdaBoostandFisher- date the dual variables u with (22), and r with (23). Boost.ItcanbeobservedthatFisherBoostplacesmore 10 ChunhuaShenetal. emphasis on positive samples than negative samples to 95% total variation and has a final dimension of 228. ensure these positive samples would be classified cor- ForDaimler-Chryslerpedestriandatasets(Munderand rectly. AdaBoost, on the other hand, treat both posi- Gavrila2006),weapplyPCAtotheoriginal18×36pix- tiveandnegativesamplesequally.Thismightbedueto els. The projected data capture 95% variation and has the fact that AdaBoost only optimizes the overall clas- a final dimension of 139. For indoor/outdoor scene, we sification accuracy. This finding is consistent with our divide the 15-scene data set used in (Lazebnik et al. results reported earlier in (Paisitkriangkrai et al. 2009; 2006) into 2 groups: indoor and outdoor scenes. We Shen et al. 2011). use CENTRIST as our feature descriptors and build 50 visual code words using the histogram intersection kernel (Wu and Rehg 2011). Each image is represented 5.2 Comparison With Other Asymmetric Boosting in a spatial hierarchy manner. Each image consists of 31sub-windows.Intotal,thereare1550featuredimen- Inthisexperiment,FisherBoostandLACBoostarecom- sions per image. All 5 classifiers are trained to remove pared against several asymmetric boosting algorithms, 50%ofthenegativedata,whileretainingalmostallpos- namely, AdaBoost with LAC or Fisher LDA post-pro- itivedata.WecomparetheirdetectionratesinTable1. cessing (Wu et al. 2008), AsymBoost (Viola and Jones From our experiments, FisherBoost demonstrates the 2002), cost-sensitive AdaBoost (CS-ADA) (Masnadi- best performance on most data sets. However, LAC- ShiraziandVasconcelos2011)andrateconstrainedboo- Boost does not perform as well as expected. We sus- sting(RCBoost)(SaberianandVasconcelos2012).The pect that the poor performance might partially due to results of AdaBoost are also presented as the baseline. numerical issues, which can cause overfitting. We will For each algorithm, we train a strong classifier consist- discuss this in more detail in Section 5.6. ing of 100 weak classifiers along with their coefficients. The threshold was determined such that the false pos- itive rate of test set is 50%. For every method, the ex- periment is repeated 5 times and the average detec- 5.3 Face Detection Using a Cascade Classifier tion rate on positive class is reported. For FisherBoost and LACBoost, the parameter θ is chosen from {1/10, In this experiments, eight asymmetric boosting meth- 1/12, 1/15, 1/20} by cross-validation. For AsymBoost, ods are evaluated with the multi-exit cascade (Pham we choose k (asymmetric factor) from {20.1, 20.2, ··· , et al. 2008), which are FisherBoost/LACBoost, Ad- 20.5} by cross-validation. For CS-ADA, we set the cost aBoost alone or with LDA/LAC post-processing (Wu formisclassifyingpositiveandnegativedataasfollows. etal.2008),AsymBoostaloneorwithLDA/LACpost- We assign the asymmetric factor k = C /C and re- processing.WehavealsoimplementedViola-Jones’face 1 2 strict 0.5(C +C ) = 1. We choose k from {1.2, 1.65, detector (AdaBoost with the conventional cascade) as 1 2 2.1, 2.55, 3} by cross-validation. For RCBoost, we con- the baseline (Viola and Jones 2004). Furthermore, our duct two experiments. In the first experiment, we use face detector is also compared with state-of-the-art in- the same training set to enforce the target detection cluding some cascade design methods, i.e., WaldBoost rate,whileinthesecondexperiment;weuse75%ofthe (Sochman and Matas 2005), FloatBoost (Li and Zhang training data to train the model and the other 25% to 2004),BoostingChain(Xiaoetal.2003)andtheexten- enforce the target detection rate. We set the target de- sion of (Saberian and Vasconcelos 2010), RCECBoost tectionrate,D ,to99.5%,thebarriercoefficient,γ,to (Saberian and Vasconcelos 2012). The algorithm for T 2andthenumberofiterationsbeforehalvingγ, N ,to training a multi-exit cascade is summarized in Algo- d 10. rithm 2. We tested the performance of all algorithms on five We first illustrate the validity of adopting LAC and real-world data sets, including both machine learning FisherLDApost-processingtoimprovethenodelearn- (USPS) and vision data sets (cars, faces, pedestrians, ingobjectiveinthecascadeclassifier.Asdescribedabo- scenes).WecategorizedUSPSdatasetsintotwoclasses: ve,LACandLDAassumethatthemarginofthetrain- even digits and odd digits. For faces, we use face data ing data associated with the node classifier in such a sets from (Viola and Jones 2004) and randomly ex- cascade exhibits a Gaussian distribution. We demon- tract 5000 negative patches from background images. strate this assumption on the face detection task in We apply principle component analysis (PCA) to pre- Fig. 3. Fig. 3 shows the normal probability plot of the serve 95% total variation. The new data set has a di- margins of the positive training data for the first three mension of 93. For UIUC car (Agarwal et al. 2004), we nodeclassifiersinthemulti-exitLACclassifier.Thefig- downsizetheoriginalimagefrom40×100pixelsto20× urerevealsthatthelargerthenumberofweakclassifiers 50 pixels and apply PCA. The projected data capture used the more closely the margins follow the Gaussian