(cid:13)c 2015 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. 5 1 0 2 n a J 5 2 ] V C . s c [ 1 v 0 8 1 6 0 . 1 0 5 1 : v i X r a IEEETRANSACTIONSONCIRCUITSANDSYSTEMSFORVIDEOTECHNOLOGY,VOL.X,NO.X,MONTHYEAR 2 Exploring Human Vision Driven Features for Pedestrian Detection Shanshan Zhang, Christian Bauckhage, Member, Dominik A. Klein, and Armin B. Cremers Abstract—Motivated by the center-surround mechanism in the human visual attention system, we propose to use average contrast maps for the challenge of pedestrian detection in street scenes due to the observation that pedestrians indeed exhibit discriminative contrast texture. Our main contributions are first to design a local, statistical multi-channel descriptor in order to incorporate both color and gradient information. Second, we introduceamulti-directionandmulti-scalecontrastschemebased on grid-cells in order to integrate expressive local variations. Contributingtotheissueofselectingmostdiscriminativefeatures forassessingandclassification,weperformextensivecomparisons w.r.t. statistical descriptors, contrast measurements, and scale structures. This way, we obtain reasonable results under various Fig. 1: Human retinal tissue: (a) Schematic cross-section (im- configurations. Empirical findings from applying our optimized age adapted from [2]); (b) Spatial wiring and DoG weighting detector on the INRIA and Caltech pedestrian datasets show of retinal ganglion cells. thatourfeaturesyieldstate-of-the-artperformanceinpedestrian detection. Keywords—center-surround contrast, human vision, channels, multi-direction, multi-scale, pedestrian detection. (rods for lightness resp. cones for colors) have transformed incident light into electric signals (cf. Fig. 1). In a first layer of bipolar cells, electrical membrane potentials are locally I. INTRODUCTION aggregated. Grouped bipolar cells report to different types of THE problem of pedestrian detection is attracting growing ganglion cells, which convert analog potentials into electric attentioninthecomputervisioncommunityasithasmany pulse rates. At the transitional synapses between photorecep- practical applications in areas such as video surveillance or tive and bipolar cells, but also from bipolar to ganglion cells, driving assistance systems. There is a quickly growing body there is a lateral wiring of so called horizontal respectively of work on accurate and efficient detection of pedestrians in amacrinecellsmodulatingthesignalstoenhancecontrastsina imageorvideodata(see[1]forarecentsurvey).Contributions center-surroundfashion.Itwasfoundthattheoutputofcertain havebeenmaderegardingproblemssuchasfeatureextraction, ganglioncellsagreeswithsimpledifferenceofGaussian(DoG) classifier design, occlusion handling and the like. Although filter responses [3] or more complex oriented Gabor filter there were significant improvements over the last decade, results [4]. A more in-depth survey on retinal cell types and one must acknowledge that the precision of state-of-the-art their wiring can, for instance, be found in [5]. pedestrian detectors still lags behind human vision, which The center-surround mechanism is also found in later pro- is capable of rapidly localizing pedestrians under various cessing stages in the brain where it guides human attention scales, poses, and occlusions even in low quality images. andthusaffectshowpeoplerecognizeobjectsofinterest.This This motivates us to analyze how the human vision system psychophysical theory has been widely used in computational processes incoming stimuli and to devise corresponding novel approaches to generate saliency maps of the environment [6]. features for pedestrian detection. In this paper, we present However, while attention is about bottom-up processing, experimental results which show that the use of biologically model-free analysis of signals from the environment, visual inspired mechanisms can indeed aid recognition. search for specific entities requires top-down saliency which In the human visual system, processing of information be- tunesthescoringofbasicfeaturestotheexpectedappearance. ginsintheretinaltissueimmediatelyafterphotoreceptivecells In this article, we propose to emulate center-surround con- trast features motivated by the human visual system and to ManuscriptreceivedMONTHDAY,YEAR;revisedMONTHDAY,YEAR. tune them towards characterizations of the appearance of S. Zhang and D. Klein are with the Department of Computer Science III, University of Bonn, Ro¨merstraße 164, 53117 Bonn, Germany (e-mail: pedestrians. Our previous findings about human vision driven [email protected],[email protected]). features were published in [7]; in this paper, we explore the C.BauckhageiswithFraunhoferIAIS,SchlossBirlinghoven,53757Sankt configurations of feature design and achieve better perfor- Augustin,Germany(e-mail:[email protected]). mance. Our contributions are summarized as follows: A. B. Cremers is with Bonn-Aachen International Center for Informa- Statistical multi-channel cell descriptors: We collect tion Technology (B-IT), Dahlmannstraße 2, 53113 Bonn, Germany (e-mail: [email protected]) multi-channel information for each cell area, i.e. local image patch, not only regarding lightness and colors, but also w.r.t. Color image Intensity Contrast map Average contrast map gradients which complement each other in recognizing broad variations of clothing or articulations of the human body. In order to summarize the underlying, unknown distribution of each cell’s channel values, we propose two kinds of distri- butions: (1) a continuous Gaussian distribution which models maximum entropy given a known mean and variance [8]; (2) a bilinear interpolated histogram which is a representation of frequencies observed over discrete intervals (bins). ... ... Multi-direction and -scale contrast vectors: Aiming at incorporating more specific information between central and surrounding cells, we treat adjacent image regions in different directions individually rather than as a single surrounding region and thus obtain multi-direction contrast descriptors; we computestatisticalfeaturesatdifferentcell-sizessoastobuild (a) Average contrast map for pedestrians acontrastpyramidwhichaccordswiththegeneralarchitecture Color image Intensity Contrast map Average contrast map of most visual saliency systems. Extensive evaluations on various configurations: In or- der to determine the strongest feature scheme for pedestrian detection, we implement various contrast measurements for bothdistributionsandatdifferentscalestructures.Inextensive evaluations on the INRIA dataset, we find that the advisable scheme is to use a Gaussian-W combination and a 4-6-8-10 2 scale structure. Ourpresentationproceedsasfollows:SectionIIprovidesan ... ... overviewofrelatedwork;SectionIIIpresentsdetailsastoour feature extraction mechanism. Two key components of this mechanism, namely statistical descriptors and contrast mea- sures are explained in Section IV and Section V, respectively. Our classification procedure is presented in Section VI, fol- (b) Average contrast map for non-pedestrian examples lowed by a discussion of thorough and extensive experiments in Section VII where we compare different feature schemes Fig. 2: Heat maps of average center-surround contrasts gener- and state-of-the-art detectors on standard benchmarks. Finally, ated from positive and negative samples of the INRIA pedes- we conclude and propose several directions for future work in trian dataset. Warmer colors indicate higher contrast values. Section VIII. II. RELATEDWORK Since our focus in this paper is on emulating the center- statistics. HOG features brought about significant improve- surround mechanism in human vision in order to design ments and therefore establish an important baseline. Several new contrast features for pedestrian detection, the following researchers have extended this feature pool and added further literature review mainly considers difference based features features. For example, Liu et al. [10] introduced the idea of for pedestrians and center-surround contrast measures used by a granularity space, i.e. a family of descriptors ranging from computational visual attention approaches. edgelets to HOGs. Local Binary Pattern (LBP) features [11] are another kind of pixel-wise difference based features which express relative A. Features for pedestrian detection intensity relationships between neighboring pixels by binary Most features for pedestrian detection interpret local or codes. Wang et al. [12] combined LBP features with HOG global pixel differences in various forms. This is because featuresinordertobettercopewithocclusions;Maet al. [13] pixeldifferencesrepresenttextureinformationwhichareoften proposed a set of edge orientation histogram (EOH) and characteristic for classes of objects and thus allow for robust oriented LBP based features to describe cell-level and block- classification. level structure information. Gradients (vectors of directed derivatives) are popular fea- Haar-likefeatures[14],ontheotherhand,areconsideredas tures as they describe differences w.r.t. intensity or colors patch-wiselocaldifferencesastheycomputesumsofintensity between neighboring pixels and allow for characterizing these values over rectangular image regions. Zhang et al. [15] in terms of magnitudes and orientations. The arguably most designed Haar-like templates tailored to up-right human body popular kind of feature Histograms of Oriented Gradients and achieved significant improvement. (HOGs)[9]forpedestriandetectionisindeedbuiltongradient Color Self Similarity (CSS) features proposed by Walk Rectangular cells Rectangular distributions Multi-channel and -direction Feature pyramid center-surround contrast maps Basic channels dimendseisocndriipsttorirbution . . . ... ... ... Input RGB Image ... ... scale ... ... ... Fig. 3: Flow chart of our feature extraction mechanism. Here, we consider a three-scale structure as an example but different scale structures can be used as well. et al. [16]describeglobaldifferencesbetweenpairsofimage fact, the background in our case is much more complex and cells in terms of color histograms. Significant improvements the previous contrast models are not guaranteed to perform were achieved by combining HOG features and CSS features, well. Consequently, we train and evaluate specialized contrast since they allow for representing uniform textures found in schemes in this paper and aim to find out the optimal config- people’s clothing. uration for our applications. Althoughextensiveeffortshavebeenmadetointerpretlocal difference in various ways, the performance of all the above features is still far behind humans’ capabilities. We therefore III. OVERVIEWONFEATUREEXTRACTION argue that it is worthwhile to look into how the human brainprocessesvisualinputsandnexttomimiccorresponding This section introduces the feature extraction procedure mechanisms in order to design more representative features. we consider in this paper. First of all, we demonstrate that To our knowledge, the first attempt of designing human center-surround contrasts are discriminative for pedestrians. vision inspired features dedicated to pedestrian detection can We collect 2,416 positive samples (consisting of pedestrians) be found in [17]. Although our motivation in this article is a and 5,000 negative samples (no pedestrians included) from similar one, our features considerably differ from the ones in the INRIA pedestrian dataset [9]. All the sample color images [17]. The following clear distinctions can be drawn: 1) we are converted to gray images since considering only intensity consider local difference between central and surrounding is sufficient to showcase the possible performance gain. The square regions rather than between pixels; 2) we compute the contrast map of each image is computed on two scales of center-surround contrasts in multiple channels (not only on 4×4 and 6×6 pixels. The contrast value, represented by the colors but also on gradient information); 3) we do not use difference between central and surrounding cell regions w.r.t. image channel values directly but describe their distributions meanvalue,isaddedtothecentralregion.Finally,twoaverage usingstatisticalentities;4)wedonottreatneighboringregions contrast maps are generated for the pedestrian class and non- asawholebutindividuallyandthusincorporatemoredetailed pedestrianclass,respectively.InFig.2,weseethattheaverage information regarding local difference. contrast map for pedestrians indeed resembles a human body while the average contrast map for non-pedestrians shows no B. Center-surround contrast measures distinct pattern. Most computational approaches to visual attention deter- Based on the above observation, we design our center- mine center-surround contrasts by DoG-filters or approxi- surround contrast features for pedestrian detection. An illus- mations of these [18]. Recently, several researchers repre- tration of our feature extraction procedure is shown in Fig. 3. sented the central and surrounding areas in terms of feature First, we compute multiple channels (e.g. color and gradient distributions so as to capture more information about the information) for each pixel in an image; second, we divide areas. These distributions were either discrete, e.g. in form eachchannelmapintosquarecellsofafixedsizeanddescribe of histograms [19], or continuous, e.g. fitted to a normal each cell using statistical distributions; third, we compute the distribution[20],andvariousdistancemeasurescanbeapplied differencesbetweeneachcellanditseightnearestneighboring betweencentralandsurroundingdistributionstoquantifylocal cells so as to obtain a multi-direction contrast vector; finally, contrast. we repeat the second and third step along each channel with However, we notice that the above strategies only achieve different cell sizes and thus obtain a multi-channel, multi- reasonable results for rather conspicuous scenarios, e.g. a direction, and multi-scale contrast pyramid for the whole big red flower standing out in surrounding green leaves. In image. CXn7 CXn3 (a) (b) Fig. 4: Illustration of a histogram of oriented gradients ob- tained for a single pixel. Fig. 5: Illustration of two neighborhood patterns. (a) Sparse pattern. Each red arrow points from the central cell to one of its neighboring cells. (b) Shift pattern. Two layers of cells are A. Center-surround contrasts denotedwithgreenandbluegridlinesandtheredcellsdenote central cells with eight nearest neighboring cells. The core part of our feature extraction is how to determine the difference between two rectangular regions. To address this problem, we first choose an appropriate distribution for each region, as discussed (see Section IV) and then consider C S pattern: For each cell, its eight nearest neighbor- 1 8 corresponding contrast measures to numerically describe the ing cells are considered as surrounding cells, denoted as difference between two given distributions (see Section V). [Cs,Cs,...,Cs]. The eight surrounding cells can be treated 1 2 8 In order to determine the strongest center-surround contrast either as a whole or separately. From our experiments, we features for pedestrian detection, we conduct extensive ex- find a significantly better performance if they are treated periments and comprehensive comparisons on various com- individually(cf.Fig.10),sincedifferenceinformationineight binations of distributions and contrast measures. Experimental directions are integrated respectively. Thus, we use this C S 1 8 results under different schemes are presented in Section VII. pattern in our experiments to build a multi-direction contrast vector for each cell along every channel. B. Channels Sparse pattern: Significant redundancy will emerge if we consider eight nearest neighboring cells for each cell, because Toconsidermultiplefeaturechannelsinourschemeismoti- each adjacent pair of cells is incorporated twice. To avoid this vatedbythesuccessofDolla´r’sdetector[ChnFtrs][21],which redundancy,weuseacellstepof2cellsalongbothhorizontal has been established a strong baseline due to its accuracy and andverticaldirections,resultinginasparseneighborhoodmap efficiency. In [ChnFtrs], multiple channel maps, in terms of as shown in Fig. 5a. colors and gradients, are computed for each input image, and Shift pattern: According to the Nyquist-Shannon sampling thefinalfeaturevaluesconsistoflocalsumsatdifferentspatial theorem [23], we propose a shift mechanism where we define locations and over all channel maps. These local sums are two cell layers and iterate the C S center-surround pattern efficient to compute by employing integral images. They are 1 8 on each respectively. For the first layer, we start from the left less sensitive to noise than the individual channel values. top pixel and divide the whole image into square cells; for Similar to [ChnFtrs], we also consider a total of 10 dif- thesecondlayer,thestartingpointisshiftedwithastepof0.5 ferent channels: 3 channels for LUV colors, 1 channel for timesthecellsizealongbothhorizontalandverticaldirections gradientmagnitudeinformation,and6channelsforhistograms andwethendividetheimagepatchfromthenewstartingpoint of oriented gradients. Note that all the above channels are into square cells with the same cell size. An illustration of the computed pixel by pixel. Histograms of oriented gradients are shift mechanism is shown in Fig. 5b. usuallycomputedforagroupofpixelsinsideanimageregion, Multi-scalepattern:Finally,inordertodescribecontrastsat but we compute them pixelwise which is to say we simply different scales, we use different cell sizes to build a contrast quantize the gradient magnitudes into orientation bins. For pyramid which is in accordance with the general architecture each pixel, two neighboring bins are affected as we employ of most computational visual attention systems. bilinear interpolation w.r.t. orientation bins, see Fig. 4 for an illustration. IV. STATISTICALCELLDESCRIPTORS Prior to channel computation, input images are smoothed In order to assess the underlying distribution inside each with a binomial filter [22] of radius 1, i.e. σ ≈0.87, in order cell, we estimate both continuous and discrete statistical ap- toremovenoise.Notethatweexplicitlydonotsmoothchannel proximations: (1) a Gaussian distribution, which is the type of data as we observed this to lead to decreased performance. continuous distribution with maximum entropy given a known meanandvariance;(2)abilinearinterpolatedhistogram,which C. Center-surround neighborhood patterns is a representation of frequencies, determined for discrete Here, we present details on our design of center-surround intervals (bins). cell pairs. Four patterns are proposed in this paper and ex- In our following discussion, we assume that we have mea- plained in the following. suredvaluesofchanneliforthewholeinputimage.Wedenote this data as channel image Pi and consider a specific cell c we obtain a histogram Hi as a descriptor vector for channel c with its channel vector Pi =[vi,vi,...,vi]. vector Pi: c 1 2 p c b (cid:88) A. Gaussian distributions Hi =[hi(1),hi(2),...,hi(b)], hi(k)=1. (4) c c c c c k=1 The true distribution of channel values for local image patches is unknown, but is modeled as Gaussian type in this section. This assumption is made not only because normality V. CONTRASTMEASUREMENTS makes further estimations convenient to solve, but also due to Aiming for the strongest center-surround contrast features, its popularity in classic low-level vision models, for example, we introduce multiple contrast measurements for each dis- in [24]. tribution descriptor to make a comprehensive comparison in For numerical description, we apply maximum likelihood this section. Combining a distribution descriptor and a cor- (ML) estimation of the parameters and obtain mean and responding measurement forms a specific scheme for feature variance values as extraction. We note that the cell descriptors introduced above p are statistical distributions whose comparison requires care. 1(cid:88) µˆi = vi =Pi, (1) Although the Euclidean distance is often used in practice, c p k c it is not truly faithful to the nature of this kind of data. In k=1 particular,whencomparingdistributionsorhistograms,weare and dealing with compositional data [26]. This is to say that, for a p Σˆi = 1(cid:88)(vi −Pi)2 =(Pi)2−Pi2. (2) normalizedhistogramH =[h(1),...,h(b)]ofbbins,thereare c p k c c c only b−1 degrees of freedom, since the value of an arbitrary k=1 bin h(i) is determined by 1 − (cid:80)b h(k). It is therefore k(cid:54)=i Now the estimation is narrowed down to computing two impossible to perturb one bin of a histogram without affecting localaverages:Pi and(Pi)2 accordingtoEq.1andEq.2.For the others. This has implications for similarity measurements c c efficiency, we employ two integral images for each channel: that are not accounted for by the Euclidean metric. However, one for the original channel image Pi and the other for there are several distance- or similarity measures that cope the squared channel image (Pi)2 and thus avoid extensive with these characteristics and we consider their use in our summations per individual cell. context.Tosummarize,weexploresixdifferentmetricsinthis OncetheparametersoftheGaussianhavebeendetermined, paper:Gaussian-W2 distance,Gaussian-L2 distance,Gaussian we obtain a descriptor for cell c for channel i: gradient matrices, histogram Kullback-Leibler divergences, histogram Hellinger distances, and histogram intersections. Di(c)=[µi,Σi]. (3) In the following, we denote the channel distributions for c c a central and a surrounding cell as Pi and Pi, respectively. −→ c s The contrast vector cst(Pi,Pi) is computed using different B. Histograms c s measures. Histogramsareareasonablediscreterepresentationofdistri- butions without any prior assumption of the underlying statis- A. Gaussian distributions tics. They count the observed frequencies of data appearing in discrete intervals. The advantage of using a histogram We introduce three different contrast measures to compute is furthermore that it tolerates noise and minor intra-class the difference between two cells’ channel distributions, each differences and that its degree of tolerance can be adjusted represented by the Gaussian descriptor in Eq. 3. We compare by choosing appropriate numbers of bins. Generally, using a the results of those three measures in Section VII. smaller number of bins results in a coarser description of the 1)W distance: The W distance (2nd Wasserstein dis- 2 2 original data, and vice versa. tance) was first introduced as a measure for center-surround It is computationally expensive to naively compute his- contrast by Klein et al. [20] and achieved reasonable results tograms for all cells and all sizes so that we employ integral for saliency detection. Its definition in our case can be written histograms [25]. An integral histogram can be considered as a as: stack of integral images each counting the sums of values to (cid:20) (cid:90) (cid:21)1 the top and left from a pixel that fall into a certain histogram 2 W (Pi,Pi)= inf |x−y|2dγ(x,y) , (5) bin. 2 c s γ∈Γ(Pci,Psi) R×R To eliminate bias, we implement bilinear interpolation for histograms, i.e. each value contributes into two nearest bins where Γ(Pi,Pi) denotes the set of all couplings of Pi and c s c with a weight relating to distance between the given value Pi. s and the bin center. Also, normalization is rather important for It would be intractable to compute the integral in Eq. 5 histograms, since it eliminates the effect of data magnitudes. in case of arbitrary distributions. However, for the Gaussian In this paper, we normalize each local histogram for each cell distribution, it can be solved analytically [27]. The contrast and channel so that it sums up to 1. In the end, given b bins, vector between one central cell distribution Pi ∼ N(µi,Σi) c c c and its neighboring cell distribution Pi ∼ N(µi,Σi) of s s s Scales 4-6 4-6-8 4-6-8-10 channel i indeed amounts to: −→ −→ −→ Feature size 20,320D(cst) 23,440D(cst) 25,040D(cst) (cid:20) (cid:21)1 (cid:112) 2 W (Pi,Pi)= ||µi −µi||2+Σi +Σi −2 ΣiΣi . (6) 2 c s c s 2 c s c s TABLE I: Illustration of feature size under different configu- rations. All the contrast measurements used in this paper are We note that Wasserstein distances are natural metrics for one dimensional, except SGrd, which is two dimensional. the comparison of two probability distributions where one distribution is derived from the other one through small, non- uniform perturbations; in the computer vision literature, the 2)Hellinger distance: Let P and Q be two probability discretizedWassersteindistanceisalsoreferredtoastheEarth distributions with respect to a probability measure λ; the Mover’s distance [28]. Hellinger distance is a measure of their difference that is 2)L2 distance: If we treat the two-dimensional descriptors independent of λ. The square of the Hellinger distance has for the central and surrounding cells as two 2D points, then a particularly simple form and is defined as [30]: the L2 distance between (µi,Σi) and (µi,Σi) amounts to: c c s s (cid:112) 1(cid:90) (cid:18)(cid:114)dP (cid:114)dQ(cid:19)2 D (Pi,Pi)= (µi −µi)2+(Σi −Σi)2. (7) H2(P,Q)= − dλ. (11) L2 c s c s c s 2 dλ dλ 3)Signed gradient matrix (SGrd): For each center- For two discrete probability distributions Hi and Hi that surround cell pair, we compute the signed gradient matrix for represent Pi and Pi, the Hellinger distance is tchen comsputed the mean and variance vector [µi,Σi], resulting in a contrast as the contrcast betwseen Pi and Pi: vector. The contrast vector between one central cell distri- c s Pbusiti∼onNP(ciµis∼,ΣNis)(µofic,cΣhaicn)naenldiciatsnntheeignhbbeoerixnpgrecseslelddaisstfroiblluotwiosn: H2(Hi,Hi)= √1 (cid:118)(cid:117)(cid:117)(cid:116)(cid:88)b (cid:18)(cid:112)hi(k)−(cid:112)hi(k)(cid:19)2. (12) c s c s 2 −−−→ (cid:20) (cid:21) k=1 SGrd(Pi,Pi)= µi −µi,Σi −Σi . (8) c s c s c s 3)Histogram intersection: The histogram intersection is another popular similarity measure for histograms. Given two In the feature space, the contrast vector in Eq. 8 is treated histograms H and H with n bins, it is defined as: p q in terms of two separate values which enables a convenient n training procedure. (cid:80) min(H (k),H (k)) p q HI(H ,H )= k=1 . (13) p q n (cid:80) H (k) B. Histograms p k=1 We consider three different distance measures which are As all histograms considered in this paper are normalized commonly used for histograms. In the following, the his- so that they sum up to 1, the histogram intersection between tograms for a central and a surrounding cell w.r.t. channel i Hi and Hi can be further simplified to: are denoted as in Eq. 4. We compare the results of the three c s measurements in Section VII. b (cid:88) 1)Kullback-Leibler divergence: Using information theo- HI(Hi,Hi)= min(hi(k),hi(k)). (14) c s c s retic arguments, one can represent the difference between a k=1 center and a surround cell using the Kullback-Leibler Diver- gence (KLD) [29], VI. CLASSIFICATION (cid:90) ∞ p(x) In this section, we discuss our approach towards classifica- DKL(P||Q)= p(x)lnq(x)dx. (9) tion of the center-surround features introduced above. First of −∞ all,weaddressthesizeofourfeaturepool.Givenapedestrian Thus, the KLD between two probability distributions P and model of 60 × 120 pixels, Tab. I compares feature sizes Q is a relative entropy that indicates the loss in information under different settings in terms of scales and dimensions of if P is approximated by Q. The more P differs from Q, the contrastvectors.Apparently,thefeaturepoolgrowsoncemore higher the KLD. scales are employed. Among all the contrast measurements Given the histograms Hi and Hi of two channel vectors considered in this paper, only the signed gradient matrix is c s Pi and Pi, we calculate the discrete KLD as our first contrast two dimensional, while all others are one dimensional. c s measure: To efficiently train classifiers on such a large feature pool, we employ a fast version of AdaBoost [31] since it offers a (cid:88)b (cid:18)hi(k)(cid:19) convenient and fast approach to feature selection from a large DKL(Hci||Hsi)= ln hic(k) hic(k). (10) number of candidate features. The feature selection procedure k=1 s is conducted for each feature configuration individually. INRIA [9] Caltech [32] GO6 34.92 20 20 GO5 26.71 icmoalogrinigmsaegteusp ph√oto mo√bile GO4 27.53 Properties video seqs. × √ 40 15 GO3 28.82 occlusion labels × √ GO2 30.38 # pedestrians 1208 192k 60 GO1 32.58 Training # pos. images 614 67k 10 GM 27.98 # neg. images 1218 61k 80 V 27.51 # pedestrians 566 155k 5 U 34.23 Testing # pos. images 288 65k 100 L 34.80 # neg. images 453 56k 0.00 20.00 40.00 120 0 20 40 60 TABLE II: Statistics of two pedestrian datasets used for (a) (b) experiments [1]. Fig. 6: Illustration of representative features under one config- uration.(a)Bodypartsweightmap:differentcolorsareusedto The most discriminative features selected by the boosting indicate the accumulative weight of each pixel after boosting. algorithmarethenusedforpedestriandetectioninstillimages. (b)Channelweightbars:accumulativeweightofeachchannel Our pedestrian model is of size 60×120 pixels and we resize is indicated by one bar. theinputimagetodetectpedestriansatdifferentscales.Tothis end, we slide a window over the whole image and consider multiple scales. The spatial step size is set identical to the cell size for speed reasons and the scale step is set to be 1.09 so For boosting algorithms, one should choose proper weak that there are 8 scales per octave. We use a simplified non- classifiers so as to build the final strong classifier. We use maximalsuppression(NMS)procedure[21]tosuppressnearby decision trees of depth 2 as our weak classifiers since they are detections. efficient to learn. Another important parameter is the number ofweakclassifiers,which,afterextensiveexperimentation,we choosetobe4096,asweobservethatmoreweakclassifiersdo VII. EXPERIMENTS notleadtogainsinperformance.Similartoclassicapproaches In this section, we introduce the benchmark datasets and to pedestrian detection [9], [21], we also employ a multi- evaluation protocols used in our experiments, provide com- roundtrainingstrategywhichhasbeenshowntoleadtobetter prehensive comparisons for different feature schemes, and performance than a simple one round training procedure with compare our best detector configuration with state-of-the-art the same number of samples. For the first round, the initial detectors. negative training samples are randomly cropped from the neg- ative example images; in the following rounds, hard negative samples are exhaustively searched over all negative example A. Benchmark datasets images using the classifier built in the previous round. This Experiments are conducted on two public benchmark procedureisiterateduntilnosignificantperformancegainsare datasets:theINRIAPersonDataset[9]andtheCaltechPedes- observed with further retraining. From our experiments, three trianDetectionBenchmark[1].Acomparisonoftheabovetwo rounds of retraining yield optimal performance; additional datasets is given in Tab. II. roundsdidnotshowsignificantimprovements.Wecollect5000 INRIA Person Dataset: This is arguably the most popular negative samples at each round, resulting in a large negative dataset for people detection and comes along with pre-defined sample pool of 20,000 image patches after four rounds. subsets for training and testing. In the training set, there In order to look into which local features regions are more are 2416 positive samples, by mirroring from 1208 identical informative, we plot a weight image of the top 100 feature pedestrianimages;and12,180naturalimages,wherenopedes- positions with highest weights from the final strong classifier, trians are included so that negative samples can be randomly as shown in Fig. 6a. To generate this map, we add the weight generatedbycroppingsubregions.Inthetestset,thereare288 of each selected feature to the cells it covers and use different positive samples, including 566 pedestrian annotations. colors to indicate the accumulative weight of each cell after Caltech Pedestrian Detection Benchmark: This is currently boosting. As expected, the head-shoulder area of the human the largest and most challenging dataset for pedestrian detec- body shows to be more discriminative for pedestrian detection tion, consisting of approximately 10 hours of 640×480 30Hz than other body parts. Moreover, we also add the weight of video taken from a vehicle driving through regular traffic in each feature separated by channels to indicate which ones are an urban environment. About 250,000 frames with a total of more representative and use bars to illustrate the accumulative 350,000 bounding boxes and 2300 unique pedestrians were weight of each channel as shown in Fig. 6b. We find that all annotated. The training data (set00-set05) and the test data the channels we chose contributed rather evenly to the final (set06-set10) consist of approximately 192,000 and 155,000 classifier, indicating no channel redundancy. pedestrian annotations, respectively. B. Evaluation protocol .50 Inthefollowing,weexplaindetailsofourevaluationproto- .40 col in four aspects, which are consistent with the conventions .30 in this field [1]. 1)Ground truth filtering: In our experiments, a reasonable subset of all ground truth data is considered, in which pedes- e.20 at triansataresolutionofover50pixelsinheightandavisibility s r s ofmorethan65%areconsidered.Outliersaremarkedwithan mi ignorelabel,whichmeanstheyneednotbematched,however, .10 matches are not considered as mistakes either. 2)Detection results filtering: We filter out detection results 17.33% L2 usinganexpandedfilteringmethod[1],sothatdetectionresults 16.67% SGrd faroutsidetheevaluationscalerangeshouldnotbeconsidered. 16.44% W2 .05 In this paper, we evaluate a scale range of [50,+∞], only 10−3 10−2 10−1 100 false positives per image detections in [50/ξ,+∞] are considered for evaluation. In our (a) Gaussian distributions experiments, we set ξ =1.25 [1]. 3)Bounding box matching rules: A filtered ground truth .80 bounding box and detection bounding box are annotated by B , and B respectively. B , and B match if and only if .64 gt dt gt dt the ratio of overlap to the union of their areas exceeds a given .50 threshold [1]: .40 match(Bdt,Bgt)= aarreeaa((BBddtt))∪∩ aarreeaa((BBggtt)) >! 0.5 . (15) s rate.30 s 4)Performancemeasurements: Weperformevaluationw.r.t. mi full images instead of detection windows as the former one .20 provides a natural measure of error of an overall detection system. In this paper, we employ two measurements to com- 22.23% HI pare performance among different detectors. First, we plot 19.72% KLD 19.22% Hellinger miss rate against false positives per image (FPPI) curves in .10 10−3 10−2 10−1 100 logarithmic scales by varying the threshold on the detection false positives per image confidence of the classifiers. In addition to this miss rate vs. (b) Histograms FPPI curves, we calculate a single, numerical measurement to summarize each detector’s performance. We use the average miss rate [1], which is computed by averaging the miss rates Fig.7:Experimentsondifferentcontrastmeasuresfortwocell descriptors. at nine FPPI rates evenly sampled in log-space in the range of [10−2,100]. This average miss rate generally gives a more stable and informative assessment of the overall performance fordifferentdetectorsthanthemissrateatonly10−1 FPPI[1]. are comparably better. Therefore, we select Gaussian-W and 2 Hist-Hellinger as the two preferable combinations which pro- C. Comparisons for different feature settings duce best results for Gaussian and Histogram descriptors, Inthissection,weseekthestrongestfeatureschemethrough respectively. experiments under different settings on the INRIA dataset. 2)Numberofhistogrambins: Thenumberofbinsisanim- First, we define a default setting with: three scales of 4×4, portant parameter in practical applications of histograms. Not 6×6, and 8×8 pixels; 5 histogram bins when histograms are surprisingly,weobserveperformancechangeswhenincreasing used. the number of histogram bins. Fig. 8 shows experimental In the following, we compare different descriptors, contrast results when using 5, 10, 15 and 20 bins and the Hellinger measurements,scalestructures,andnumbersofhistogrambins distance which has been shown to be the best among all the where histograms are used. contrast measures for histograms. Generally, more histogram 1)Contrastmeasurements: Weinvestigatedifferentcontrast bins integrate more accurate information of the local cell measures for two descriptors respectively. From Fig. 7a, we region, thus leading to better performance. If we increase the see that both descriptors produce stable results using different numberofhistogrambinsfrom5to15,asexpectedweobtain contrast measures. Despite of their stable performance, we better results. However, performance begins to decrease again observeaslightdifferencebetweendifferentcontrastmeasures. when we consider more than 20 bins since these settings are For Gaussian descriptors, L2 distance performs worst, and moreerrorproneundernoisyrealworlddata.Notethat,inthe W distance and SGrd produce comparably better results. For following experiments, we thus use 15-bin histograms instead 2 histograms, HI is the worst measure, and KLD and Hellinger of the default 5-bin histograms. 20% 40% 19.22% 19.10% 19.10% e 28.64% 31.51% e miss rate19% ge miss rat2300%% 16.44% 17.33% 29.3% 33.82% 34.55% 35.21% Averag18% Avera10% 16.67% 18.72% 17.83% 20.23% 0 17.83% Gaussian−W2 Gaussian−L2 Gaussian−SGrd Hist−KLD C1S1 17% 5 10 15 20 Hist−Hellinger C1S8 Hist−HI Fig. 8: Experiments with different histogram bins using the Fig. 10: Comparison of two center-surround patterns. Hellinger distance. .64 .50 22.44% 24.91% .40 30% 25.32% miss rate..2300 erage miss rate1200%% 15.9% 161.64.48%7% 171.363.3%2% 161.677.6%8% 181.77.22%1%231.7716.98%.37%1%242.03.52%3% 28.51% v A .10 0 Gaussian−W2 Gaussian−L2 22.18% ChnFtrs [baseline] scale4−6 17.83% Hist (15 bins)−Hellinger Gaussian−SGrd 16.44% Gaussian−W2 Hist−KLD scale4−6−8 .0150 −3 10−2 10−1 100 Hist−Hellinger scale4−6−8−10 Hist−HI false positives per image Fig. 11: Comparison of three scale structures. In this experi- Fig. 9: Comparison of two optimal descriptor-measurement ment, 15-bin histograms are used. combinations and the baseline detector [ChnFtrs]. w.r.t. miss rates. Therefore, we choose a scale structure of 4- 3)Descriptors: From Fig. 9, we can see that both opti- 6-8-10 as our optimal choice. mal combinations outperform the baseline detector [ChnFtrs] In summary, the optimal feature setting is to use the com- which illustrates the effectiveness of our new features. Be- bination of Gaussian-W and scale structure of 4-6-8-10. We tween the two optimal combinations, Gaussian-W produces 2 2 use this configuration in the following experiments. better overall results than Hist(15 bins)-Hellinger. Therefore, Gaussian-W is selected as the optimal descriptor-contrast- 2 measurement combination in this paper. D. Computational complexity 4)C S pattern vs. C S pattern: We proposed the C S We investigate the computational complexity of different 1 8 1 1 1 8 pattern in Section III in order to incorporate more information feature settings. Our normal-distribution as well as histogram about local image differences. Here, we compare the perfor- based features are computed from local averages of certain manceofbothpatternstoshowwhythedirectedC S pattern values.SuchlocalfeaturescanbecomputedinO(n)timewith 1 8 is superior. From Fig. 10, it appears that the C S pattern ndenotingthenumberofimagepixelsusingmovingaverages 1 8 produces better results than C S over all descriptor-contrast- or integral image techniques. They only differ in the number 1 1 measurement combinations. of layers needed (one for each distribution parameter or bin) 5)Scalestructures: Generally,theuseofmorescalesincor- whichamountstoaconstantfactor.Lookingintodetailsofthe porates richer information and leads to a better performance. diverse distance functions implemented for different feature In this paper, we consider three different scale structures: settings, we can see the very same effect: the time complexity 4-6; 4-6-8; and 4-6-8-10, and show their comparisons in is constant per pixel (linear growing with image size), so the Fig.11.Increasingthescalesfrom4-6to4-6-8bringsabouta overall complexity for each setting is still O(n). We have to significant improvement of approximately 5% w.r.t. miss rate; note that the constant factor for normal-distributions is 2 per on the other hand, continuing to increase scales to 4-6-8-10 inputchannel,whilehistogramsrequireb≥2(e.g. 15)number produces a less prominent performance gain of less than 1% of histogram bins.

