To Boost or Not to Boost? On the Limits of Boosted Trees for Object Detection Eshed Ohn-Bar and Mohan M. Trivedi Computer Vision and Robotics Research Laboratory University of California San Diego {eohnbar, mtrivedi}@ucsd.edu Abstract—We aim to study the modeling limitations of the commonly employed boosted decision trees classifier. Inspired by the success of large, data-hungry visual recognition models (e.g. deep convolutional neural networks), this paper focuses on … therelationshipbetweenmodelingcapacityoftheweaklearners, Increase datasetsize,anddatasetproperties.Asetofnovelexperimentson Training Augment Model Better/Worse the Caltech Pedestrian Detection benchmark results in the best Dataset Dataset Capacity Performance? knownperformanceamongnon-CNNtechniqueswhileoperating 7 at fast run-time speed. Furthermore, the performance is on par 1 with deep architectures (9.71% log-average miss rate), while 0 usingonlyHOG+LUVchannelsasfeatures.Theconclusionsfrom 2 Fig. 1: By varying dataset properties with different augmen- thisstudyareshowntogeneralizeoverdifferentobjectdetection n domainsasdemonstratedontheFDDBfacedetectionbenchmark tation techniques (a scaling example is shown) and model a (93.37% accuracy). Despite the impressive performance, this capacity (tree depth), we study limitations of the boosted J study reveals the limited modeling capacity of the common decisiontreeclassifier.Consequently,theinsightsrevealedare 6 boostedtreesmodel,motivatinganeedforarchitecturalchanges employed in order to train state-of-the-art pedestrian and face inordertocompetewithmulti-levelandverydeeparchitectures. detectors. ] V C I. INTRODUCTION 1 29.76% ACF−Caltec h+ . 26.21% DeepCascade+ s The increase in data availability and computational power .80 24.80% LDCF c .64 23.32% SCF+AlexNet [ have led to a revolution in object recognition. Modern object 22.49% Katamari recognitionalgorithmscombinedeepmulti-layerarchitectures .50 21.89% SpatialPooling+ 1 .40 21.86% SCCPriors withlargedatasetsinordertoachievestate-of-the-artrecogni- 20.86% TA−CNN v .30 18.47% Checkerboards 2 tion performance [1]–[3]. A main advantage of such learning 17.32% CCF+CF 169 amrcohdietleicntgurceaspsatceimtys(fer.ogm. ntehtweoarrkchpitaercatmureatlerasbialnitdydtoepitnhc)reinasae miss rate.20 11117119....18770951%%%% CDCAhoCemeeFcp+pkP+AearCrbtTso−aDrdese+p straightforward manner. At the same time, boosted detectors 14.98% LDCF84++ 0 [4]–[6] remain highly successful for fast detection of objects .10 . 1 [7]–[21].TheViolaandJones[4]learningarchitecturehasre- 0 mainedlargelyunchanged,withboosting[22]usedfortraining 7 .05 a cascade of weak learners (e.g. decision trees). State-of-the- 1 : art pedestrian detectors often employ such models [15], while v 10−3 10−2 10−1 100 101 i pursuing improved feature representations in order to achieve false positives per image X detectionperformancegains[9],[16].Despitebeingadifferent r learning architecture compared to the modern Convolutional Fig. 2: Comparison of our key results (in thick lines, ACF++ a Network (CNN) [1], the boosted decision tree model also and LDCF84++) with published methods on the Caltech test allows for a straightforward increase in modeling capacity for set. Our version of the popular ACF and LDCF models handlinglargedatasets(Fig.1).Motivatedbythisobservation, demonstrate large gains in detection performance over the we perform an extensive set of experiments designed to better baselines with no significant modifications to the model or understandthelimitationsofthecommonlyemployedboosted feature extraction pipeline. decisiontreesmodel.ThelargeCaltechpedestrianbenchmark [6] is suitable for such a task, with some preliminary work [16],[23]showingsignificantperformancegainsbyincreasing Limitations of a classifier: The relationship between in- dataset size and modeling capacity of the weak learners. creasing the modeling capacity of the decision trees jointly In particular, the contributions of this work are as follows. with dataset size has been established in previous literature [16], [22], [24], yet its limits have not been fully explored. [29]. These are composed of decision stumps such that each Surveying current literature, we found inconsistent or limited non-leaf node j produces a binary decision using a feature usage of this relationship in boosting for object detection. In index k, a threshold τ ∈ R, and a polarity p ∈ {±1}, particular, it is not known whether the modeling capacity has h (x) ≡ p sign(x[k ] − τ ). An important parameter is j j j j saturatedorwhetheritcanbefurtherincreasedwithadditional the maximum tree depth, which was generally taken to be data for the detection task. Therefore, we sought to perform a 2 for object detection, yet recent studies [16], [23], [30] rigorous study with the relationship between dataset size and propose depths of 3-4 as suitable to accommodate an increase model capacity as the main research question. In the process in dataset size. Hence, our first issue is efficient training of studying this problem, insights regarding the limitations of boosted models on large datasets. The need for quick of the model arise. The conclusion of this study is mostly training is key to algorithmic development and competitive negative, as increasing the modeling capacity of the weak performance,especiallywhenconsideringcurrentstate-of-the- learners in the current boosted decision tree detector provides art CNN object detections are trained on millions of samples. limitedgeneralizationpower.Saturationofperformanceoccurs Training with randomness: In this work, we adopt the quickly, with no further gains shown with additional data quick boosting approach from [31]. Because stump training augmentation. Therefore, architectural changes are needed in is costly, O(K × N) for K features and N samples, a order to resolve the issue of efficiently increasing modeling common practice is to quantize feature values into bins (256 capacity in the boosted tree classifier. in this work) and sample a random subset of the features Performance gains: The experiments in this paper result whensearchingfortheoptimalfeatureindex.Mostapproaches in consistent gains in object detection performance, insights employ the ACF detector [15], which samples 1/16 of the into optimal dataset properties, and best training practices total feature set in each iteration. We observed significant on Caltech. For instance, we demonstrate that the commonly variation in performance over multiple runs with this random employed video-based augmentation [16], [23] on the Caltech sampling strategy, by up to 3-4%. This is an issue, yet an training set is limited when compared to other augmentation appropriate discussion was not found in related literature. options. With no significant modifications to the learning or Therefore, we modify the last stage out of four of generating feature extraction pipelines, we improve over the Aggregate hard negatives with increasing number of weak classifiers, ChannelFeature(ACF)detector[15]baselineby∼10%points [64,256,1024,4096]. In the last stage, the feature set is ex- while still employing the 10 HOG+LUV feature channels. A haustivelysearchedoverfortheinitial512weaklearners,after simplefilteringofthechannelswith4filterbanks(LDCF[16]) which random sampling of 1/16 is followed. This procedure resultsinthebestperformingnon-CNNdetectoronCaltechto wasfoundtobenecessaryforcarefulanalysis,anditbalances date. The results of the study are shown to generalize to face reproducibility with speed (unlike the very slow exhaustive detection as well. Note that we intentionally avoid using deep search in all iterations [23]) reducing variability to within CNN features here, as these often involve training on a large ∼1% in performance. In experiments where significance is external data. On the other hand, simple channel features [15] uncertain, multiple random seeds are used for training initial- allowforisolationofthestrengths,limitations,andtheroleof ization, with the best one shown. Batch training is another the dataset on the boosting classifier. Furthermore, the careful possible direction which we leave for future work. examination of data augmentation and model capacity allows B. Augmentation Techniques for a more appropriate comparison between CNN (where extensivedataaugmentationiscommonlyemployed)andnon- As previously mentioned in Section I, large gains in CNNdetectorsondatasetssuchasCaltech.Wehopethemore boosting-based pedestrian detection came from dataset size meaningful comparison will motivate future developments in increaseandincreaseddepthofthedecisiontree(introducedin object detection. We also note that several recent CNN-based Nametal.[16]),althoughthedesignchoicesforthesehavenot models on Caltech [25]–[27] make use of the boosted trees been well justified or studied. Consequently, existing state-of- classifier. the-art detectors employ dense frame sampling (e.g. sampling every 3rd frame resulting in a ten-fold increase in dataset II. EXPERIMENTALSETTINGS size, Caltech×10), yet this type of video-based augmentation A. Boosting Framework will be shown to present several issues leading to sub-optimal detector performance. This section presents the tools that will be used for the Identifying the most successful augmentation techniques is remainder of the study. Boosting [22] involves greedily mini- essential, as the increased dataset size regularizes training of mizing a loss function for the following classification rule modelswithincreasedcapacity.Wenotethatothershavestud- (cid:88) ieddataaugmentation,specificallyforpedestriandetection,but H(x)= α h (x) (1) t t mostly with a goal of lowering annotation effort (i.e. render t pedestrian images into real backgrounds [32], [33]) and not whereh (x)areweaklearners,α arescalars,andx∈RK improved performance. t t is a feature vector. In this work, we employ the commonly In addition to sampling more frames from the Caltech usedsoftcascadewithdecisiontreesastheweaklearners[28], videos, the following augmentations are studied: Color:Coloralterationhasbeenproposedin[1]usingPrin- implementation of [6] with 50k total negative samples (in cipal Component Analysis (PCA) on the RGB pixel values in each hard negative mining round we collect N/2 samples and thetrainingset.Specifically,arandomGaussianvariable,α∈ replace these hard samples so that no more than N samples R3 is drawn and the quantity [p ,p ,p ][α λ ,α λ ,α λ ], are trained over in a given round). We note that the only 1 2 3 1 1 2 2 3 3 where p and λ are eigenvector-eigenvalue pairs, is added to form of augmentation from Section II-B commonly employed i i the RGB image pixels. in training boosting-based pedestrian detectors besides higher Flipping: Most current trained detectors employ horizontal frame sampling from video is flip augmentation. Hence, flip mirroring of images in order to double the number of positive augmentaiton is included in the L3N50 baseline. It achieves samples. comparable results to the best known ACF results on Caltech Translation (T): Two main parameters, namely the max- of 29.76% log-average miss rate (Fig. 2). imum amount of pixel shift allowed (m) and the sampling Flip augmentation: Surprisingly, Fig. 3(a) depicts how the intervals(n),determinehowmanyadditionalpositivesamples removaloftheflipaugmentationintheL3N50baselineresults are generated. We note these in the experiments as Tnm. For inasignificantreductioninlog-averagemissratefrom29.28% instance, T31 generates 4 additional samples at 1 pixel shift to26.67%!Thereasonfortheperformancereductionbecomes in each direction (west/east/north/south). apparent immediately, when aspect ratio (ar) standardization Scale+Crop(S):Asexperimentsbegan,wediscoveredhigh (usedintraining/testingallcurrentobjectdetectorsonCaltech sensitivity of the classifier to even small noise injected in [6]) is removed during training in L3N50/ar. Specifically, the ground truth location. As an alternative to translation, the standardization is useful in training since it is enforced in we propose to scale each positive sample by a factor (either evaluation, yet it leads to non-aligned bounding boxes. Flip- in the horizontal, vertical, or both directions), re-center, and pingtheseboxesintroducesadditionalpoorly-alignedsamples crop it (shown in Fig. 1). Hence, the procedure is slightly which the boosting model does not handle well, even with different from the common cropping employed in training increasedmodelingcapacityinlaterstagesoftheexperiments CNN detectors (corner/center cropping). This augmentation (depth 5 or more models). The solution is simple, either is similar to slight ground truth width/height jitter without train L3N50/ar models (without aspect ratio standardiza- changes to the center of the box. tion and with flip augmentation) or remove flipped sampling Occlusion:Occlusionhandlingisanimportantchallengefor (L3N50/flip). Since this is a one-time initial modification, pedestrian detection [34], [35]. Incorporation of samples with we slightly rename our ‘new baselines’, so that L3N50 refers higher levels of occlusion is another technique for increasing toL3N50/flip(withoutflip,withaspectratiostandardization dataset size and studying generalization capability. of training samples) and L3N50∗ refers to L3N50/ar (with FG/BG Transfer: The visibility masks of pedestrians flip,noaspectstandardization).Thetwoareshowntoperform in Caltech can be employed in order to transfer fore- similarly in Figs. 3(a) and 3(b) (yet L3N50/flip is faster ground/background between images in the dataset. to train as it contains half the samples). We note that this observation was not made in any of the related research C. Model Settings studies, and consequently not employed by top performing Our training settings specify a model size of 41 × 100 boosted detectors. pixels. With padding, the sliding window becomes 64 × 128 Additional augmentation: Depicted in Fig. 3(b), the pixels. In order to deal with smaller pedestrians (‘reasonable’ L3N50 and the L3N50∗ models do not benefit from color testsettingsinvolvepedestriansofheight50pixelsandup),we augmentation, but do benefit from the scaling+crop scheme upsampletheimagesbyoneoctave.Mostoftheaugmentation (best with a scaling factor of S1.1 in horizontal, vertical, experiments will be performed on Caltech×3 (sampling every and both directions) and translation augmentation (by 1 pixel 10th frame from video), but higher samplings of Caltech×7.5 shifts, T31). These experiments were run multiple times (every4th frame),Caltech×15,andCaltech×30(everyframe) with different random seeds to ensure statistical significance. sampling will also be studied. The improvements will be shown to be more dramatic with increasedmodelingcapacity(deepertrees).Wemaketwofinal III. EXPERIMENTALANALYSIS notes, increasing the number of negatives to L3N100 does This section demonstrates the importance of data aug- not further improve performance, and overall performance mentation, impact of augmentation type, and significance of has reached a plateau. Therefore, the modeling capacity is increased weak classifier model capacity in training a gener- increased in L5N50. alizable pedestrian detector. All results are measured in log- Increasedmodelcapacity:StillonCaltech×3,theanalysis averagemissrate(alowervaluecorrespondstobetterdetection of Fig. 3(c) demonstrates a consistent trend with the previous performance) as described in [6]. experiments. Note how flip augmentation in L5N50+flip Baseline and notation: Our experiments begin with Fig. is still not beneficial, but it is much better handled. The best 3(a), where we first verify the impact of randomness with resultsarereportedinFig.3(c).Weobserveasignificantjump multiple random seeds in training (see Section II-A) on per- in performance reaching 20.85% with the fast ACF model. formancevariationusingCaltech×3.TheL3N50modelisthe Thesearethebestreportedresultstodate(nearlyoptimalfrom commonlytrainedmaximumdepth3modelusingtheavailable our experiments). Scaling is shown to be more useful than 1 1 1 .80, .80, .80, .64, .64, .64, .50, .50, .50, .40, .40, .40, .30, .30, .30, miss rate.20, miss rate.20, miss rate.20, 222555...271102%%% LLL555NNN555000+*flip 23.90% L5N50+T31 .10, .10, 30.43% L3N50*+Color .10, 25.11% L5N50+T32 30.45% L3N50-seed0 27.74% L3N50*+S1.05 25.44% L5N50+T62 29.30% L3N50-seed1 26.14% L3N50*+S1.1 28.00% L5N50+T93 29.28% L3N50-seed2 26.67% L3N50*+S1.15 20.85% L5N50+S1.1 26.46% L3N50/ar 26.57% L3N50*+S1.2 24.71% L5N50+S1.05 .05, 26.67% L3N50/flip .05, 25.85% L3N50+S1.1 .05, 26.63% L5N50+S1.15 26.81% L3N50/ar+flip 26.28% L3N50+T31 21.36% L5N100+S1.1 10-3 10-2 10-1 100 101 10-3 10-2 10-1 100 101 10-3 10-2 10-1 100 101 false positives per image false positives per image false positives per image (a) (b) (c) Fig. 3: Experimental analysis of different training settings on Caltech pedestrians using Caltech×3. 1 1 1 .80, .80, .80, .64, .64, .64, .50, .50, .50, .40, .40, .40, .30, .30, .30, miss rate.20, 222541...089227%%% LLL555NNN551000+0+SS1.11.1 miss rate.20, miss rate.20, 20.69% L5N200+S1.1 .10, 20.77% L6N200+S1.1 .10, .10, 21.74% L6N200+S1.1+T31 24.00% L6N200 (x15) 21.63% L6N400+S1.1 22.57% L6N200+S1.1 (x15) 21.66% L5N200*+S1.1 22.20% L6N800 23.68% L6N400 (x15) 20.69% L5N200+S1.1 23.75% L7N200+S1.1 21.60% L6N400+S1.1 (x15) 20.47% L5N200+S1.1/approx .05, 27.70% L8N50+S1.1 .05, 24.42% L6N400 (x30) .05, 17.15% LDCF84+S1.1 22.88% L8N200+S1.1 22.15% L6N400+S1.1 (x30) 15.40% LDCF84+S1.1/approx 10-3 10-2 10-1 100 101 10-3 10-2 10-1 100 101 10-3 10-2 10-1 100 101 false positives per image false positives per image false positives per image (a) (b) (c) Fig. 4: Experimental analysis of different training settings on Caltech pedestrians using additional video augmentation (Caltech×7.5 unless stated). translation, as even slight off-center shifts hinder performance additional data and further increase dataset in Fig. 4(b), with andarenothandledwellbytheclassifier.Thisisencouraging, no visible improvement. Increasing tree depth beyond 6 leads yet we begin to understand how limited the boosting model is to overfitting behavior, and additional data must be sampled insensitivitytodatasetproperties.Sampleswithevenminimal (although our experiments show no final gains). occlusion and FG/BG transfer were also not shown to be Better features: The achieved 20.69% miss rate is the best beneficial for increasing the dataset size. Furthermore, we known ACF results to date, nearly matching the more compu- experimentedwithadditionofexternalpedestriandatasets(e.g. tationally intensive feature-rich Checkerboards (CB) detector ETH[36])andincreasingtreedepth,butnobenefitwasshown. [37](18.47%missrate).CBemploys61filtersforeachofthe 10 core HOG+LUV channels, resulting in significant increase Video augmentation: A main conclusion is that very little to computational requirements. To further study the proposed video augmentation (Caltech×3 vs. Caltech×10 in current modifications, we extract features using 4 8×8 LDCF filters state-of-the-art detectors [16], [19], [23]) was needed to reach [16] computed using PCA eigenvectors of feature patches best performance. Denser sampling of every 4th frame is (referred to as the LDCF84 model). The method still runs 5 shown in Fig. 4(a), demonstrating an interesting trend. First, timesfasterthantheCBcomparison,whilereachinga17.15% we must increase the number of negatives to 200k (N200) missrate.Thisdemonstratesthegeneralizationoftheproposed to reach comparable performance. This fact exposes another best practices to other, feature-rich approaches as well. As a known limitation of currently used boosting models which is final experiment, we report results without the approximation not reported in related research, one of dataset imbalance. of features in the multi-scale feature pyramid [15]. This This is necessary for implicitly regularizing deeper models allowsforamoreclearperformancecomparisonagainstother (shown up to depth 8 in Fig. 4(a)). Second, we observe that methods [26], [37] which do not employ approximation. The the S1.1 scheme provides consistent improvements across gainsareevenmoreimpressive,reaching15.40%withalight- different video sampling strategies and model choices (Fig. weight LDCF84 model, significantly outperforming CB. 4(b)). This demonstrates the sub-optimal yet commonly em- ployed procedure of video augmentation. Third, additional Summary: Despite impressive gains in performance stem- augmentation or increased model capacity does not improve ming from simple, intuitive, and theory-driven considerations over the 20.69% of L5N200. To ensure no further gains in training, the boosted detector can only handle mild de- can be made, we further increase tree depth to 6 to handle formations as augmentation. For instance, color jitter is not TABLE I: Results on the new annotations on Caltech [26] helpful, flipping is problematic (and removed in most of the using log average miss rate over [10−2,100] and [10−4,100] experiments without reduction in performance), and slight (MRN andMRN,respectively).OurACF+(withdataaugmen- translation shifts produce a more difficult learning task that is -2 -4 tation) and ACF++ (with contextual reasoning) is on par with not modeled well even with deeper trees. In order to address state-of-the-art detectors while enjoying low computational these limitations, we proposed to use an augmentation tech- complexity. Our LDCF84++ employs just 4 filters on top of nique which maintains centered ground truth boxes. Dataset theACFfeaturesandoutperformstheVGGbaselinebyalarge size is key in training deeper trees, yet performance saturates margin on the more challenging metric MRN. at depth 5/6 even with extensive augmentation. Hence, unlike -4 additional layers in CNN architectures, further increase of the Method MRN-2 (MRN-4 ) tree depth provides a limited benefit in terms of classification RotatedFilters[26] 12.87(24.10) power.Theexperimentsmotivatefutureworkfortheeffective RotatedFilters+VGG[26] 9.32(21.72) ACF+ 15.17(27.28) increase of the modeling capacity of tree models. ACF++ 13.27(25.26) Run-time analysis: Although not the main focus of the LDCF84+ 11.76(21.08) study, we report run-time for the interested reader. The ACF- LDCF84++ 9.71(18.29) L5 and LDCF84 models run on a CPU at 6.7 and 2.5 frames TABLE II: Improvement due to incorporation of the insights per second (fps), respectively. These models are significantly in this paper when training face detection models. fasterthanallofthetopperformingfeature-richmodels,which range in 0.1-1 fps (e.g. CB [23] runs at 0.5 fps). ResultsonWIDER-validation Method AP(%) ACF-L3N50∗ 27.97 A. Context Analysis ACF-L9N200∗ 34.64 ACF-L9N200∗+S1.1(ACF+) 46.80 Context in the form of scene representation is essential for LDCF84-L9N200∗+S1.1(LDCF+) 47.66 robust pedestrian detection. The deeper tree models are able to better capture relationships between features in the image. In particular, we have observed how deeper models are more andfurthergainsmaybeobtainedwithcombinedwithinsights likely to select features in the padding area of the model from other studies listed in Fig. 2. around the pedestrian. As a final limitation experiment on B. Results on Caltech-New Caltech, we propose to use a location prior model in order to The proposed techniques are tested on the improved anno- study to limitations of the model in capturing spatial context. tation Caltech pedestrians dataset presented in [26], as shown First,westudiedincreasingmodelpaddingfurtherwithtree in Table I. The cleaner annotations have a great impact on depth as means of better representing contextual information, training the ACF++ and LDCF++ models, which is consistent yet this did not impact performance. On the other hand, due with our observations in previous experiments. Consequently, to the application domain, it is reasonable to study context the best detection results to date are achieved among both in terms of perspective and position of objects. A second CNN and non-CNN methods on the challenging MRN metric, approach was shown to improve performance further, hinting -4 with a miss rate of 18.29%. that there is still room for future improvements in context modeling. C. Generalization to Face Detection (cid:2) (cid:3)(cid:124) Given a detection, p = x y w h s , we construct Theproposedtrainingprocedureisappliedtofacedetection φ = pp(cid:124) for each training sample. Consequently, entries in p ontherecentlyintroducedlargeWIDERfacedataset[40]and φ areemployedasfeaturesforcapturingrelationshipbetween p the commonly employed FDDB dataset [41]. Specifically, we position, size, confidence features, and their products. The donotstandardizeaspectratioandfurtheraugmentthedataset object score, s, is recomputed using an SVM [38] on φ . We p as discussed in Section III. As shown in Table II on a valida- note that this generalizes (and outperforms) the approach in tionset,thedetectionperformanceimprovewhenuptodepth9 [39], which only employs a subset of the proposed φ feature p decisiontreesareemployed.Thisisduetothenewchallenges set (setting φ =(cid:2)h2 y2 hy h y(cid:3)). p in face detection benchmarks (e.g. rotation) not commonly AsshowninFig.2,contextintegrationconsistentlyimprove found in pedestrian detection settings. Fig. 6 demonstrates performance of both the ACF and LDCF84 models by a consistent gains in performance, resulting in a state-of-the- further 0.5-1% reduction in miss rate. Hence, there is still art face detector (on FDDB, at 2000 false positives, 93.37% room for improvement in order to better capture contextual true positive rate is achieved). On the WIDER test set, Fig. cues within the models. We term our models, ACF++ and 5 depicts a large improvement over the ACF baseline results LDCF84++, where the first ‘+’ stands for the final models reported in [40]. The performance gap is larger on the more trained with data augmentation (Fig. 4), and the second ‘+’ challenging ’moderate’ and ’hard’ test sets. for the contextual reasoning. We observe how the proposed modifications produce consistent improvements over both the IV. CONCLUDINGREMARKS ACF and LDCF baselines in Fig. 2 by about 10%. The This study presented a set of novel experiments aimed at proposed high-performing models are new baselines in a way, betterunderstandingtheboosteddecisiontreemodelcurrently 1 1 1 0.8 0.8 0.8 n0.6 n0.6 n0.6 Precisio0.4 CMMulSti−taRsCk NCNas−c0a.d90e 2CNN−0.851 Precisio0.4 CMMulSti−taRsCk NCNas−c0a.d87e 4CNN−0.820 Precisio0.4 CMMulSti−taRsCk NCNas−c0a.d64e 3CNN−0.607 LDCF+ (ours)−0.797 LDCF+ (ours)−0.772 LDCF+ (ours)−0.564 Faceness−WIDER−0.716 Multiscale Cascade CNN−0.636 Multiscale Cascade CNN−0.400 0.2 Multiscale Cascade CNN−0.711 0.2 Faceness−WIDER−0.604 0.2 Faceness−WIDER−0.315 ACF−WIDER−0.695 Two−stage CNN−0.589 Two−stage CNN−0.304 Two−stage CNN−0.657 ACF−WIDER−0.588 ACF−WIDER−0.290 0 0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Recall Recall Recall Easy set Moderate set Hard set Fig. 5: Results on the test set of the WIDER face dataset [40] (our curves are shown in red, LDCF+) compared with the current state-of-the-art (updated plots on January 5, 2017), Multiscale Cascade CNN [40], Two-stage CNN [40], ACF [15], [47], Faceness [43], Multitask Cascade CNN [52], and CMS-RCNN [53]. Results are shown for the three difficulty settings described in [40]. Observe the large improvement in performance over the ACF baseline. 1 decision trees even with extensive augmentation. Although carefuldataaugmentationwasbeneficial,architecturalchanges 0.95 for handling large datasets are required on the decision-tree 0.9 training level for effective increase of modeling capacity. 0.85 e 93.37% LDCF+ (ours) V. ACKNOWLEDGMENTS ve rat 0.8 9911..4734%% ADCPF2M+ F(oDurs) We acknowledge support of associated industry partners ositi0.75 90.99% Faceness and thank our UCSD CVRR colleagues, especially Rakesh p 88.09% HeadHunter ue 0.7 87.20% Kumar et al. Rajaram, for helpful discussions. We also thank the reviewers Tr 86.33% MultiresHPM for their constructive comments. 0.65 86.08% ACF−multiscale 85.90% CCF 0.6 85.67% CascadeCNN REFERENCES 84.84% DDFD 0.55 82.79% NPDFace [1] A.Krizhevsky,I.Sutskever,andG.E.Hinton,“Imagenetclassification 74.69% Pico withdeepconvolutionalneuralnetworks,”inNIPS,2012. 1,3 0.50 200 400 600 800 1000 1200 1400 1600 1800 2000 [2] P. Sermanet, K. Kavukcuoglu, S. Chintala, and Y. LeCun, “Pedestrian False positive detectionwithunsupervisedmultistagefeaturelearning,”inCVPR,2013. 1 Fig. 6: Results on the FDDB [41] dataset (discrete score [3] K.SimonyanandA.Zisserman,“Verydeepconvolutionalnetworksfor evaluation)comparedwiththestate-of-the-art(DP2MFD[42], large-scaleimagerecognition,”ICLR,2014. 1 [4] P.ViolaandM.Jones,“Robustreal-timefacedetection,”IJCV,vol.57, Faceness [43], HeadHunter [44], Kumar et al. [45], Mul- no.2,pp.137–154,2004. 1 tiresHPM [46], ACF-multiscale [47], CCF [19], CascadeCNN [5] F.Bartoli,G.Lisanti,S.Karaman,A.D.Bagdanov,andA.D.Bimbo, [48], DDFD [49], NPDFace [50], and Pico [51]). True pos- “Unsupervisedsceneadaptationforfastermulti-scalepedestriandetec- tion,”inICPR,2014. 1 itive rate at 2000 false positives is shown for each method. [6] P. Dolla´r, C. Wojek, B. Schiele, and P. Perona, “Pedestrian detection: Our results significantly improve over previous state-of-the- Anevaluationofthestateoftheart,”PAMI,2012. 1,3 artwithHOG+LUVfeatures[44],whilecompetingwithCNN [7] S. Zhang, C. Bauckhage, and A. B. Cremers, “Efficient pedestrian detection via rectangular features based on a statistical shape model,” approaches. T-ITS,2015. 1 [8] F. De Smedt, K. Van Beeck, T. Tuytelaars, and T. Goedeme´, “The combinator: Optimal combination of multiple pedestrian detectors.” in ICPR,2014. 1 employed for many vision tasks. As CNN-based models [9] R. Benenson, M. Omran, J. Hosang, and B. Schiele, “Ten years of pedestriandetection,whathavewelearned?”inECCV-CVRSUAD,2014. employextensiveaugmentation,wesoughttoinvestigatemod- 1 ifying the boosted detector to handle such augmentation as [10] R. Benenson, M. Mathias, R. Timofte, and L. V. Gool, “Pedestrian well. Careful analysis regarding the generalization power and detectionat100framespersecond,”inCVPR,2012. 1 [11] S. Zhang, D. A. Klein, C. Bauckhage, and A. B. Cremers, “Center- modeling capacity, dataset size and balance, and overfitting surroundcontrastfeaturesforpedestriandetection,”inICPR,2014. 1 handling produced insights as to the performance limits of [12] M.J.JonesandD.Snow,“Pedestriandetectionusingboostedfeatures such detectors, as well as state-of-the-art pedestrian and face overmanyframes,”inICPR,2008. 1 [13] P.Dolla´r,Z.Tu,P.Perona,andS.Belongie,“Integralchannelfeatures,” detectors. A main takeaway of this work is in the limited rep- inBMVC,2009. 1 resentationabilityoftheboostedmodel.UnlikeCNNmodels, [14] A.D.CosteaandS.Nedevschi,“Wordchannelbasedmultiscalepedes- which consistently benefit from additional modeling capacity triandetectionwithoutimageresizingandusingonlyoneclassifier,”in CVPR,2014. 1 (e.g. depth) and data, increasing weak learner complexity [15] P.Dolla´r,R.Appel,S.Belongie,andP.Perona,“Fastfeaturepyramids saturates early, not allowing for further gains with deeper forobjectdetection,”PAMI,2014. 1,2,4,6 [16] W. Nam, P. Dolla´r, and J. H. Han, “Local decorrelation for improved [50] S. Liao, A. Jain, and S. Li, “A fast and accurate unconstrained face pedestriandetection,”inNIPS,2014. 1,2,4 detector,”PAMI,2015. 6 [17] E.Ohn-BarandM.M.Trivedi,“Learningtodetectvehiclesbyclustering [51] N. Markus, M. Frljak, I. S. Pandzic, J. Ahlberg, and R. Forchheimer, appearancepatterns,”T-ITS,2015. 1 “A method for object detection based on pixel intensity comparisons,” [18] S.Paisitkriangkrai,C.Shen,andA.vandenHengel,“Strengtheningthe CoRR,vol.abs/1305.4537,2013. 6 effectivenessofpedestriandetectionwithspatiallypooledfeatures,”in [52] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao, “Joint face detection and ECCV,2014. 1 alignment using multi-task cascaded convolutional networks,” CoRR, [19] B.Yang,J.Yan,Z.Lei,andS.Z.Li,“Convolutionalchannelfeatures,” vol.abs/1604.02878,2016. 6 inICCV,2015. 1,4,6 [53] C.Zhu,Y.Zheng,K.Luu,andM.Savvides,“CMS-RCNN:contextual [20] Z.Cai,M.Saberian,andN.Vasconcelos,“Learningcomplexity-aware multi-scaleregion-basedCNNforunconstrainedfacedetection,”CoRR, cascadesfordeeppedestriandetection,”inICCV,2015. 1 vol.abs/1606.05413,2016. 6 [21] A.D.Costea,A.V.Vesa,andS.Nedevschi,“Fastpedestriandetection formobiledevices,”inITSC,2015. 1 [22] R. E. Schapire, Y. Freund, P. Bartlett, and W. S. Lee, “Boosting the margin: A new explanation for the effectiveness of voting methods,” TheAnnalsofStatistics,1998. 1,2 [23] S. Zhang, R. Benenson, and B. Schiele, “Filtered channel features for pedestriandetection,”inCVPR,2015. 1,2,4,5 [24] N.Das,E.Ohn-Bar,andM.M.Trivedi,“Onperformanceevaluationof driver hand detection algorithms: Challenges, dataset, and metrics,” in ITSC,2015. 2 [25] J.Hosang,M.Omran,R.Benenson,andB.Schiele,“Takingadeeper lookatpedestrians,”inCVPR,2015. 2 [26] S. Zhang, R. Benenson, M. Omran, J. Hosang, and B. Schiele, “How fararewefromsolvingpedestriandetection?”inCVPR,2016. 2,4,5 [27] Y.Tian,P.Luo,X.Wang,andX.Tang,“Deeplearningstrongpartsfor pedestriandetection,”inICCV,2015. 2 [28] L. Bourdev and J. Brandt, “Robust object detection via soft cascade,” inCVPR,2005. 2 [29] R. J. Quinlan, “Bagging, boosting, and c4.5,” in National Conference onArtificalIntelligence,1996. 2 [30] S.Paisitkriangkrai,C.Shen,andA.vandenHengel,“Pedestriandetec- tion with spatially pooled features and structured ensemble learning,” arXiv,2014. 2 [31] R.Appel,T.Fuchs,P.Dolla´r,andP.Perona,“Quicklyboostingdecision trees-pruningunderachievingfeaturesearly,”inICML,2013. 2 [32] J. Nilsson, P. Andersson, I. Y. H. Gu, and J. Fredriksson, “Pedestrian detectionusingaugmentedtrainingdata,”inICPR,2014. 2 [33] Va´zquez, David, Lo´pez, A. M, and D. Ponsa, “Unsupervised domain adaptationofvirtualandrealworldsforpedestriandetection,”inICPR, 2012. 2 [34] R.N.Rajaram,E.Ohn-Bar,andM.M.Trivedi,“Anexplorationofwhy andwhenpedestriandetectionfails,”inITSC,2015. 3 [35] E. Ohn-Bar and M. M. Trivedi, “Can appearance patterns improve pedestriandetection?”inIV,2015. 3 [36] A.Ess,B.Leibe,K.Schindler,andL.V.Gool,“Amobilevisionsystem forrobustmulti-persontracking,”inCVPR,2008. 4 [37] S. Zhang, R. Benenson, and B. Schiele, “Filtered channel features for pedestriandetection,”CVPR,2015. 4 [38] C.-C. Chang and C.-J. Lin, “LIBSVM: A library for support vector machines,”ACMIntell.Sys.andTech.,2011. 5 [39] D. Park, D. Ramanan, and C. Fowlkes, “Multiresolution models for objectdetection,”inECCV,2010. 5 [40] S. Yang, P. Luo, C. C. Loy, and X. Tang, “WIDER FACE: A face detectionbenchmark,”inCVPR,2016. 5,6 [41] V.JainandE.Learned-Miller,“FDDB:Abenchmarkforfacedetection inunconstrainedsettings,”UniversityofMassachusetts,Amherst,Tech. Rep.UM-CS-2010-009,2010. 5,6 [42] R.Ranjan,V.M.Patel,andR.Chellappa,“Adeeppyramiddeformable partmodelforfacedetection,”CoRR,vol.abs/1508.04389,2015. 6 [43] S.Yang,P.Luo,C.C.Loy,andX.Tang,“Fromfacialpartsresponses tofacedetection:Adeeplearningapproach,”inICCV,2015. 6 [44] M.Mathias,R.Benenson,M.Pedersoli,andL.V.Gool,“Facedetection withoutbellsandwhistles,”inECCV,2014. 6 [45] V. Kumar, A. Namboodiri, and C. V. Jawahar, “Visual phrases for exemplarfacedetection,”inICCV,2015. 6 [46] G. Ghiasi and C. C. Fowlkes, “Occlusion coherence: Detecting and localizingoccludedfaces,”CoRR,vol.abs/1506.08347,2015. 6 [47] B.Yang,J.Yan,Z.Lei,andS.Z.Li,“Aggregatechannelfeaturesfor multi-viewfacedetection,”inIJCB,2014. 6 [48] H.Li,Z.Lin,X.Shen,J.Brandt,andG.Hua,“Aconvolutionalneural networkcascadeforfacedetection,”inCVPR,2015. 6 [49] S. S. Farfade, M. Saberian, and L.-J. Li, “Multi-view face detection usingdeepconvolutionalneuralnetworks,”inICMR,2015. 6