FORREVIEW:IEEETRANSACTIONSONPATTERNANALYSISANDMACHINEINTELLIGENCE 1 Learning And-Or Model to Represent Context and Occlusion for Car Detection and Viewpoint Estimation Tianfu Wu∗, Bo Li∗ and Song-Chun Zhu Abstract—ThispaperpresentsamethodforlearninganAnd-Ormodeltorepresentcontextandocclusionforcardetectionand viewpointestimation.ThelearnedAnd-Ormodelrepresentscar-to-carcontextandocclusionconfigurationsatthreelevels:(i) spatially-alignedcars,(ii)singlecarunderdifferentocclusionconfigurations,and(iii)asmallnumberofparts.TheAnd-Ormodel embedsagrammarforrepresentinglargestructuralandappearancevariationsinareconfigurablehierarchy.Thelearningprocess consistsoftwostagesinaweaklysupervisedway(i.e.,onlyboundingboxesofsinglecarsareannotated).Firstly,thestructureofthe And-Ormodelislearnedwiththreecomponents:(a)miningmulti-carcontextualpatternsbasedonlayoutsofannotatedsinglecar boundingboxes,(b)miningocclusionconfigurationsbetweensinglecars,and(c)learningdifferentcombinationsofpartvisibilitybased 5 1 onCADsimulations.TheAnd-OrmodelisorganizedinadirectedandacyclicgraphwhichcanbeinferredbyDynamicProgramming. 0 Secondly,themodelparameters(forappearance,deformationandbias)arejointlytrainedusingWeak-LabelStructuralSVM.In 2 experiments,wetestourmodelonfourcardetectiondatasets—theKITTIdataset[1],thePASCALVOC2007cardataset[2],andtwo self-collectedcardatasets,namelytheStreet-ParkingcardatasetandtheParking-Lotcardataset,andthreedatasetsforcarviewpoint p estimation—thePASCALVOC2006cardataset[2],the3Dcardataset[3],andthePASCAL3D+cardataset[4].Comparedwith e state-of-the-artvariantsofdeformablepart-basedmodelsandothermethods,ourmodelachievessignificantimprovementconsistently S onthefourdetectiondatasets,andcomparableperformanceoncarviewpointestimation. 7 2 IndexTerms—CarDetection,CarViewpointEstimation,And-OrGraph,HierarchicalModel,Context,OcclusionModeling. (cid:70) ] V C 1 INTRODUCTION . s c 1.1 MotivationandObjective (a) (c) [ CARisoneofthemostfrequentlyseenobjectcategoryin 2 everydayscenes.Cardetectionandviewpointestima- v tion by a computer vision system has broad applications 9 5 such as autonomous driving and parking management. 3 Fig. 1 shows a few examples with varying complexities in 7 car detection from four datasets. Car detection and view- (d) 0 point estimation are challenging problems due to the large . 1 structural and appearance variations, especially ubiquitous (b) 0 occlusions which further increase the intra-class variations 5 significantly. In this paper, we are interested in learning a 1 unified model which can detect cars in the four datasets : v and estimate car viewpoints. We aim to address two main i issuesinthefollowing. X Fig. 1. Illustration of varying complexities in car detection from four Thefirstistoexplicitlyrepresentocclusion.Occlusionis datasets.(a)ThePASCALVOC2007cardataset[2]consistsofsingle r a acriticalaspectinobjectdetectionforseveralreasons:(i)we carsunderdifferentviewpointsbutwithlessocclusionaspointedoutin do not know ahead of time what portion of an object (e.g. [5].(b)TheKITTIcarbenchmark[1]includeson-roadcarscapturedbya cameramounteduponadrivingcarwhichhavemoreocclusionsbutre- car)willbevisibleinatestimage;(ii)wealsodonotknow strictedviewpoints.(c)TheStreet-Parkingcardataset[6]includescars theoccludedareasinweakly-labeledtrainingdata(i.e.only withheavyocclusionsbutlessmulti-carcontextand(d)TheParking-Lot bounding boxes of single cars are given, as considered in cardataset[7]consistsofcarswithheavyocclusionsandrichmulti-car context.TheproposedAnd-Ormodelislearnedforcardetectioninall this paper); and (iii) object occlusions in testing data could fourdatasets. be very different from those in training data. Handling oc- clusionsentailsmodelscapableofcapturingtheunderlying regularitiesofocclusionsatpartlevel(i.e.differentocclusion configurations). Thesecondistoexplicitlyexploitcontextualinformation • T.F.WuiswiththeDepartmentofStatistics,UniversityofCalifornia,Los co-occurring with occlusions (see examples in Fig.1 (b), Angeles.E-mail:[email protected] • B. Li is with Beijing Lab of Intelligent Information Technology, Beijing (c) and (d)), which goes beyond single-car detection. We Institute of Technology, China and a visiting student at University of focusoncar-to-carcontextualpatterns(e.g.,differentmulti- California,LosAngeles.E-mail:[email protected] car configurations such as 2,3 or 4 cars), which will be • S.-C. Zhu is with the Department of Statistics and Computer Science, UniversityofCalifornia,LosAngeles.E-mail:[email protected] utilizedindetectionandviewpointestimationandnaturally • ∗Jointfirstauthors. integratedwithocclusionconfigurations. ManuscriptreceivedMMDD,YYYY;revisedMMDD,YYYY. To represent both occlusion and context, we propose to FORREVIEW:IEEETRANSACTIONSONPATTERNANALYSISANDMACHINEINTELLIGENCE 2 Car CAD Models and Parts Car Type 1 Orientation Relative Position 2 Camera View ... ... 3 4 ... ... N Fig.2.Illustrationofthestatisticalregularitiesofcarocclusionsandmulti-carcontextualpatternsbyCADsimulation.Werepresentcar-to-carocclu- sionatsemanticpartlevel(left)andgeneratealargenumberofsyntheticocclusionconfigurations(middle)w.r.t.fourfactors(cartype,orientation, relativepositionandcameraview).Werepresenttheregularitiesofdifferentcombinationsofpartvisibilities(i.e.,occlusionconfigurations)bya hierarchicalAnd-Ormodel.Thismodelalsorepresentsmulti-carcontextualpatterns(right)basedonthegeometricconfigurationsofsinglecars. learnanAnd-Ormodelwhichtakesintoaccountstructural example and each entry takes a binary value, 0/1, repre- and appearance variations at multi-car, single-car and part senting occluded or not for a part under a viewpoint). The levelsjointly.OurAnd-Ormodelbelongstogrammarmod- data matrix is used to learn the occlusion configurations. els[8],[9]embeddedinahierarchicalgraphstructure,which Similarly, we learn different multi-car contextual patterns canexpressalargenumberofconfigurations(occlusioncon- based on the geometric configurations (see some examples figurationsandmulti-carconfigurations)inacompositional in the right side of Fig. 2). Note that the semantic part and reconfigurable manner. Fig.3 illustrates our And-Or annotations in the synthetic examples are used to learn the model.Byreconfigurable,itmeansthatwelearnappearance structure of our And-Or model and the parts are treated templatesanddeformationmodelsforsinglecarsandparts, as latent variables in weakly-annotated training data of andthecomposedappearancetemplatesforamulti-carcon- real images. We do not evaluate the performance of part textual pattern is inferred on-the-fly in detection according localization and instead evaluate the viewpoint estimation to the selections of their child single car Or-nodes. So, our basedontheinferredpartconfigurations. model can express a large number of multi-car contextual In the simulation, we place 3 cars in a 3×3 grid with patterns with different compatible occlusion configurations three considerations: (i) It can generate different occlusion of single cars. Reconfigurability is one of the most desired configurations for the car in the center under different property in hierarchical models, which plays the main role cameraviewpoints,aswellasdifferentmulti-carcontextual in boosting the performance in our experiments, and also patterns(2-caror3-carpattern),whichiseasierthanusing2 distinguishes the proposed method to other models such carsinprocessingthedatainsimulation.(ii)Itcangenerate as the visual phrase model [10] and different object-pair the synthetic dataset in which the occlusion configurations models[11],[12],[13],[14]. and multi-car contextual patterns are generic enough to coverthefoursituationsinFig.1.(iii)Itcanalsoreducethe gapbetweenthesyntheticdataandrealdatawhenlearning 1.2 MethodOverview the initial appearance parameters for parts with the car in 1.2.1 DataPreparationwithSimulationStudy the back instead of the white background (see more details inSec.5). Manually annotating car views, parts and part occlusions onrealimagesaretime-consumingandusuallyerror-prone. 1.2.2 TheAnd-OrModel Oneinnovationinthispaperisthatwegeneratealargeset TherearethreetypesofnodesintheAnd-Ormodel:anAnd- ofocclusionconfigurationsandmulti-carconfigurationsby CAD models 1 and a publicly available graphics rendering node represents decomposition (e.g., a car is composed of a engine, the SketchUp SDK 2. In the CAD simulation, the small number of parts), an Or-node represents alternative ways of decomposition accounting for structural variations occlusion configurations and multi-car contextual patterns (e.g., different part configurations of a single car due to reflect variations in four factors: car type, orientation, relative occlusions), and a Terminal-node captures appearance vari- positionandcameraview.Wedecomposeacarinto17seman- ationstogroundacaroraparttoimagedata. ticpartsasshownindifferentcolorsintheleftsideofFig.2. Fig. 3 illustrates the learned And-Or model. The hierar- We then generate a large number of examples by placing 3 chyconsistsofalayerofmulti-carcontextualpatterns(top) cars in a 3×3 grid (resembling the regularities of cars in andseverallayersofocclusionconfigurationsofsinglecars parking lots or on the road, see the middle of Fig. 2). For (bottom).Theoverallstructureisas-follows: thecarsinthecenter,wecomparetheirpartvisibilitiesfrom i) The root Or-node represents different multi-car con- differentviewpoints(asillustratedbythecameraicons),and figurations which capture both viewpoints and car-to-car obtain the part occlusion data matrix (each row represents an contextual patterns. Each multi-car contextual pattern is then represented by an And-node (e.g., car pairs and car 1.weused40CADmodelsselectedfromwww.doschdesign.comand Google3Dwarehouse triples shown in the figure). The contextual information 2.www.sketchup.com reflect the layout regularities of a small number, N (e.g., FORREVIEW:IEEETRANSACTIONSONPATTERNANALYSISANDMACHINEINTELLIGENCE 3 Multi-car contextual layout Different occlusion patterns And-node Or-node Terminal-node One Parse Tree reconfigured on-the-fly within the same viewpoint Layer ID 0 Multi-car Configuration Branches … 1 1-Car … 2 1st Car 2nd Car 3rd Car … ... ... ... Single Car Branches 3 … … … Consistently Visible Parts Optional Part Clusters 6 Fig.3.IllustrationofourAnd-Ormodelforcardetection.Itrepresentsmulti-carcontextualpatternsandocclusionconfigurationsjointlybymodeling spatially-alignedmulti-carstogetherandcomposingvisiblepartsexplicitlyforsinglecars.(Bestviewedincolor) N ∈{2,3}),ofcarsinrealsitations(suchascarsinaparking 1.2.3 Weakly-supervisedLearningoftheAnd-OrModel lot). Using weakly-annotated real image training data and the ii) A multi-car And-node is decomposed into nodes rep- syntheticdata,welearntheAnd-Ormodelintwostages: resenting single cars. Each single car is represented by i) Learning the structure of the hierarchical And-Or model. an Or-node (e.g., the 1st car and the 2nd car), since we Boththemulti-carcontextualpatternsandocclusionconfig- have different combinations of car types, viewpoints and urations of single cars are learned automatically based on occlusion configurations.Here, a multi-car And-node em- the annotated single car bounding boxes in training data bedsthereconfigurablecompositionalgrammarofamulti- together with the synthetic examples generated from CAD car configuration (e.g., the three 2-car configurations in the simulations.Themulti-carcontextualpatternsareminedor right-topofFig.2)inwhichthesinglecarsarereconfigurable clusteredfromthegeometriclayoutfeatures.Theocclusion w.r.t. viewpoint and occlusion configuration (up to some configurations are learned by a clustering method using extend),andcartype.Thisreconfigurabilitygivesourmodel the part visibility data matrix. The learned structure is a expressivepowertohandlethelargevariationsofmulti-car directed and acyclic graph since we have both single-car- configurationsinrealsitations. sharingandpart-sharing,thusDynamicProgramming(DP) canbeappliedininference. iii) Each occlusion configuration is represented by an And- ii) Learning the parameters for appearance, deformation and node which is further decomposed into parts. Parts are bias. Given the learned structure of the And-Or model, we learned using CAD simulation (i.e., the 17 semantic parts) jointly train the parameters in the structural SVM frame- and are organized into consistently visible parts and op- workandadopttheWeak-LabelStructuralSVM(WLSSVM) tional part clusters (see the example in the right-bottom method[15],[16]inimplementation. of Fig. 3). Then, a single car can be represented by the consistentlyvisibleparts(i.e.,And)andoneoftheoptional 1.2.4 Experiments part clusters (i.e., Or). The green dashed bounding boxes show some examples corresponding to different occlusion In experiments, we evaluate the detection performance of configurations(i.e.,visibleparts)fromthesameviewpoint. our model on four car datasets: the KITTI dataset [1], the FORREVIEW:IEEETRANSACTIONSONPATTERNANALYSISANDMACHINEINTELLIGENCE 4 PASCAL VOC2007 car dataset [2] and two self-collected represent occlusion configurations. Recent work [39], [40], datasets–theStreet-Parkingdataset[6]andtheParkingLot [41], [42], [43] proposed to enumerate possible occlusion dataset[7](whicharereleasedwiththispaper).Ourmodel configurations and model each occlusion configuration as outperforms different state-of-the-art variants of DPM [17] a specific component. [44] proposed a 2D model to learn (including the latest implementation [18]) on all the four discriminative subcategories, and [45] further integrated it datasets, as well as other state-of-the-art models [6], [14], withanexplicit3Docclusionmodel,bothshowingexcellent [19], [20] on the KITTI and the Street-Parking datasets. We performance on the KITTI dataset. Though those models evaluate viewpoint estimation performance on three car were successful in some heavily occluded cases, they did datasets: the PASCAL VOC2006 car dataset [2], the 3D car not represent contextual information, and usually learned dataset[3],andthePASCAL3D+cardataset[4].Ourmodel another separate context model using the detection scores achieves comparable performance with the state-of-the-art asinputfeatures.Recently,anAnd-Orquantizationmethod methods (significantly better than the method using deep was proposed to learn And-Or tree models [24], [46] for learningfeatures[21]).Thedetectioncodeanddataareavailable generic object detection in PASCAL VOC [2] and learn 3D ontheauthor’shomepage3. And-Or models [47] respectively, which could be useful in Paper Organization. The remaining of this paper is occlusionmodeling. organized as follows. Section 2 overviews the related work ii) Object-Pair and Visual Phrase Models. To account for and summarizes our contributions. Section 3 presents the the strong co-occurrence, object-pair [11], [12], [13], [14] And-Or model and defines its scoring functions. Section and visual phrase [10] methods modeled occlusions and 4 presents the method of mining multi-car contextual pat- interactions using a X-to-X or X-to-Y composite template ternsandocclusionconfigurationsofsinglecarsinweakly- that spans both one object (i.e., “X” such as a person or labeled training data. Section 5 discusses the learning of a car) and another interacting object (i.e., “X” or “Y” such modelparametersusingWLSSVM,aswellasdetailsofthe as the other car in a car-pair in parking lots or a bicycle DPinferencealgorithm.Section6presentstheexperimental on which a person is riding). Although these models can resultsandcomparisonsoftheproposedmodelonthefour handleocclusionbetterthansingleobjectmodels,theobject- car detection datasets and the three viewpoint estimation pairorvisualphrasemodeledocclusionimplicitly,andthey datasets.Section7concludesthepaperwithdiscussions. were often manually designed with fixed structures (i.e., notreconfigurableininference).Theyperformedworsethan originalDPMintheKITTIdatasetasevaluatedby[14]. 2 RELATED WORK AND OUR CONTRIBUTIONS iii) Context Models. Many context models have been ex- Over the last decade, object detection has made much ploitedinobjectdetectionwithimprovedperformance[48], progress in various vision tasks such as face detection [22], [49], [50], [51], [52]. Hoiem et al. [50] explored a scene pedestrian detection [23], and generic object detection [2], context, Desai et al. [49] improved object detectors by in- [17], [24]. In this section we focus on occlusion and con- corporatingthemulti-classcontextonthepascaldataset[2] text modeling in object detection, and classify the recent in a max-margin framework. In [51], Tu and Bai integrated literature into three research streams. For a full review of thedetectorresponseswithbackgroundpixelstodetermine contemporary approaches, we refer the reader to recent the foreground pixels. In [52], Chen et. al. proposed a surveyarticles[25],[26],[27]. multi-ordercontextrepresentationtotakeadvantageofthe i) Single Object Modeling and Occlusion Modeling. Hier- co-occurrence of different objects. Recently, [53] explored archical models are widely used in the recent literature of geographiccontextualinformationtofacilitatecardetection, object detection and most existing approaches are devoted and [54] explored a 3D panoramic context in object detec- to learning a single object model. Many work extended tion.Althoughtheseworkverifiedthatcontextiscrucialin the deformable part-based model [17] (which has a two- objectdetection,mostofthemmodeledobjectsandcontext layer structure) by exploring deeper hierarchy and global separately,notinaunifiedframework. part configurations [15], [24], [28], using strong manually- Thispaperisextendedfromourtwopreviousconference annotatedparts[29]orCADmodels[30],orkeepinghuman papers [6], [7] in the following aspects: (i) A unified repre- in-the-loop [31]. To address the occlusion problem, various sentation is learned for integrating occlusion and context; occlusion models estimate the visibilities of parts from (ii)Moredetailsonthelearningalgorithmandthedetection image appearance, using assumptions that the visibility of algorithm are presented; (iii) More analyses and compar- a part is (a) independent from other parts [32], [33], [34], isonsontheexperimentalresultsareaddedwithimproved [35], [36], (b) consistent with neighboring parts [15], [37], performance. or (c) consistent with its parent or child parts describing Thispapermakesthreecontributionstotheliteratureof object appearance at different scales [38]. Another essential cardetection. problem is to organize part configurations. Recently, [6], i) It proposes an And-Or model to represent multi-car [15],[34]exploreddifferentwaystodealwiththisproblem. context and occlusion configurations. The proposed model In particular, [34] modeled different part configurations by is multi-scale and reconfigurable to account for large struc- the local part mixtures. [15] used a more flexible grammar ture,viewpointandocclusionvariations. modeltoinferboththeoccluderandvisiblepartsofanoc- ii) It presents a simple, yet effective, approach to mine cludedperson.[6]regularizedpartsintoconsistentlyvisible context and occlusion configurations from weakly-labeled parts and optional part clusters, which is more efficient to trainingdata. iii) It introduces two datasets for evaluating occlusion 3.http://www.stat.ucla.edu/˜tfwu/projects.htm andmulti-carcontext,andobtainsperformancecomparable FORREVIEW:IEEETRANSACTIONSONPATTERNANALYSISANDMACHINEINTELLIGENCE 5 to or better than state-of-the-art car detection methods in the resolution of the object templates, and < ·,· > denotes fourchallengingdatasets. the inner product. Fig.3 shows some learned appearance templates. An And-node A ∈ V represents a decomposition 3 REPRESENTATION AND INFERENCE And of a large entity (e.g., a multi-car layout at Layer 1 or a 3.1 TheAnd-OrModelandScoringFunctions single car at Layer 3 in Fig.3) into its constituents (e.g., In this section, we introduce the notations in defining the 2 or 3 single cars or a small number of parts). Single And-Ormodelanditsscoringfunctions. car And-nodes are associated with viewpoints. Unlike the AnAnd-Ormodelisdefinedbya3-tuple,G =(V,E,Θ), Terminal-nodes, single car And-nodes are not allowed to where V = VAnd ∪VOr ∪VT, represents the nodes in three be deformable in a multi-car configuration in this paper subsets:And-nodesV ,Or-nodesV andTerminal-nodes (we implemented it in experiments and did not observe And Or VT; E is the set of edges organizing all the nodes in a performanceimprovement,soforsimplicitywemakethem directedandacyclicgraph(DAG);Θ=(Θapp,Θdef,Θbias), not deformable). Denote by ch(v) the set of child nodes of is the set of parameters (for appearance, deformation and a node v ∈ VAnd ∪VOr. The position pA of an And-node A biasrespectively,tobedefinedlater). is inherited from its parent Or-node, and then the scoring A Parse Tree is an instantiation of the And-Or model by functionisdefinedby, selectingthebestchild(accordingtothescoringfunctionsto (cid:88) bedefined)foreachencounteredOr-node.Thegreenarrows score(A,pA)= score(v|A,pA)+bA (2) v∈ch(A) inFig.3showanexampleofparsetree. AppearanceFeatures.WeadopttheHistogramofOriented where bA ∈ Θbias is the bias term. Each single car And- Gradients (HOG) feature [17], [55] to describe appearance. node (at Layer 3) can be treated as the DPM [17] or the Let I be an image defined on an image lattice. Denote by And-Or structure proposed in [6]. So, our model is flexible HtheHOGfeaturepyramidcomputedforI usingλlevels to integrate state-of-the-art single object models. For multi- per octave, and by Λ the lattice of the whole pyramid. Let car And-nodes (at Layer 1), their child nodes are Or-nodes p = (l,x,y) ∈ Λspecifyaposition(x,y)inthel-thlevelof andthescoringfunctionscore(v|A,pA)isdefinedbelow. the pyramid H. Denote by Φapp(H,pt) the extracted HOG An Or-node O ∈ V represents different structure Or features for a Terminal-node t placing at position pt in the variations(e.g.,therootnodeandthei-thcarnodeatLayer pyramid. 2 in Fig.3). For the root Or-node O, when placing at the Deformation Features. We allow local deformation when positionp∈Λ,thescoringfunctionisdefinedby, composingthechildnodesintoaparentnode.Inourmodel, parts are placed at twice the spatial resolution w.r.t. single score(O,p)= max score(v,p), (3) v∈ch(O) cars, while single cars and composite multi-cars are at the same spatial resolution. We penalize the displacements be- where ch(O) ⊂ VAnd. For the i-th car Or-node O, given tween the anchor locations of child nodes (w.r.t. the placed a parent multi-car And-node A placed at pA, the scoring parent node) and their actual deformed locations. Denote functionisthendefinedby, by δ = [dx,dy] the displacement. The deformation feature score(O|A,p )= max max(score(v,p )− A v isdefinedby, v∈ch(O) δ∈∆ Φdef(δ)=[dx2,dx,dy2,dy](cid:48). <θOde|fA,Φdef(δ)>), (4) A Terminal-node t ∈ VT grounds a single car or a part where pv = (lv,xv,yv) with lv = lA and (xv,yv) = to image data (see Layer 3 and 4 in Fig.3). Given a parent (xA,yA)+δ. The best child of an Or-node is computed by nodeA,themodelfortisdefinedbya4-tuple takingargmaxofEqn.(3)andEqn.(4). (θapp,s ,a ,θdef) t t t|A t|A 3.2 TheDPAlgorithminDetection where θtapp ⊂ Θapp is the appearance template, st ∈ {0,1} In detection, we place the And-Or model at all positions the scale factor for placing node t w.r.t. its parent node, p ∈ Λ and retrieve the optimal parse trees for all positions at|Aatwo-dimensionalvectorspecifyingananchorposition atwhichthescoresaregreaterthanthedetectionthreshold. relative to the position of parent node A, and θdef ⊂ Θdef Thank to the directed and acyclic structure of our And- t|A the deformation parameters. Given the position pA = Or model, we can utilize the efficient DP algorithm which (lA,xA,yA) of the parent node A, the scoring function of consistsoftwostages: aTerminal-nodetisdefinedby, In the bottom-up pass: Following the depth-first-search (DFS) order of nodes in the And-Or model, the bottom-up score(t|A,p )=max(<θapp,Φapp(H,p )>− A δ∈∆ t t passcomputesthematchingscoresofallpossibleparsetrees <θdef,Φdef(δ)>), (1) of the And-Or model at all possible positions in the whole t|A featurepyramid. where ∆ is the space of deformation (i.e., the lattice of the First of all, we compute the appearance score maps correspondinglevelinthefeaturepyramid),pt =(lt,xt,yt) (pyramid) for all Terminal-nodes (which is done by filter with lt = lA −stλ and (xt,yt) = 2st(xA,yA)+at|A +δ convolution).TheoptimalpositionofaTerminal-nodew.r.t. where st = 0 means the object and parts are placed at the aparentnodecanbecomputedasafunctionoftheposition sameresolutionandst =1meanspartsareplacedattwice of the parent node. The quality (matching score) of the FORREVIEW:IEEETRANSACTIONSONPATTERNANALYSISANDMACHINEINTELLIGENCE 6 optimal position for a Terminal-node w.r.t. a given posi- (a) 2 (b) tion of the parent is computed using Eqn.1 (which yields 1 2 1 the deformed score map through the generalized distance 3 3 transformtrickasdoneintheDPM[17]forefficiency),and A 2-Car Sample A 3-Car Sample the optimal position can be retrieved by replacing max in Eqn.(1)withargmax. Then, following the DFS order of nodes, we compute Fig.4.Illustrationofgeneratingmulti-carpositivesamples. tEhohpaqevtnii.msn(c2goa)l,rbe(be3erm)ananancpcodhsm(ff4opo)rruwtaeeliadltchthahtlherOeeAar-sdnncyood.dr-Seneiommbdiayelapsrrselayopn,fldwatcheOiencrigra-nnctohhodiebledtsmaniuanosxditnheinges 2tshu-cerarcoruucrnordnenifintgguBrcijaatraiosbntoshuesntfidarrintsitgncgbaorfrxo(e1ms≤NBBji1ij≤∈wBkhiii):c,hwobeotvafieirnrsltaapslleBltehcijet, Eqn.(3)and(4)withargmax. and then select the second car Bik ∈ NBj which has the Inthetop-downpass,wefirstfindalldetectioncandidates largest overlap if NBj (cid:54)= ∅ and (Ii,BiJ) i∈/ D2+−car (J = fortherootOr-nodeObasedonitsscoremaps,i.e.,theposi- {j,k}). i tions P = {p;score(O,p) ≥ τ andp ∈ Λ}. Then, following iii)IngeneratingD+ (N >2,seeFig.4(b)),foreach N−car tthhee obpretiamdathl-pfiarrsst-esetraerechat(BeaFcSh) por∈dePr:osftanrtoindgesf,rowme trheetrireovoet pwoesifitirvstesiemleacgtethweitchurkrien≥tBNKaansdth∃e(Isie,eBdiK,o)b∈taiDn(+tNh−e1n)e−icgahr-, i Or-node, we select the optimal branch of each encountered bors NBK each of which overlaps at least one bounding Onnoordd-neeo,.dBaean,sdkedereeoptnriaetlhvleethptehacresheoildpdtsinumobda-tlersepeoofrsoietoaiotcenhdeoanftcseoiauncgnhlteeTrceeardrmAAinnnaddl--- wbohxicihnhBaiiKs ,thaendlartgheesnt osevleercltapthaenbdouadnddin(Igi,bBoiJx)Btoij D∈N+N−BcaiKr (J =K∪{j}). nodes,weobtaintheviewpointestimationandtheocclusion Negative Samples. We collect negative samples in im- configuration. ages without cars appearing provided in the benchmark Post-processing. To generate the final detection results of datasets and apply the hard negative mining approach single cars for evaluation, we apply multi-car guided non- duringlearningparametersasdoneintheDPM[17]. maximumsuppression(NMS)todealwithocclusions: i) Some of the single cars in a multi-car detection can- didate are highly overlapped due to occlusion, so if we 4.2 MiningMulti-carContextualPatterns directlyuseconventionalNMS,wewillmissthedetectionof This section presents the method of learning multi-car pat- theoccludedcars.Weenforcethatallthesinglecarbound- terns in Layer 0−2 in Fig.3. Considering N ≥ 2, we use ing boxes in a multi-car prediction will not be suppressed therelativepositionsofsinglecarstodescribethelayoutof byeachother.Asimilarideaisalsousedin[12]. a multi-car sample (Ii,BiJ) ∈ DN+−car. Denote by (cx,cy) ii) Overlapped multi-car detection candidates might re- the center of a car bounding box (J = {1,··· ,N}). Let wJ portmultiplepredictionsforthesamesinglecar.Forexam- andhJ bethewidthandheightoftheunionboundingbox ple, if a car is shared by a 2-car detection candidate and a ofBJ respectively.Withthecenterofthefirstcarbeingthe i 3-cardetectioncandidate,itwillbereportedtwice.Wewill centroid,wedefinethelayoutfeatureby, keeponlytheonewithhigherscore. cx2−cx1 cy2−cy1 cxN −cx1 cyN −cy1 [ i i, i i ,··· , i i, i i ]. (6) w h w h J J J J 4 LEARNING AND-OR STRUCTURES We cluster these layout features over D+ to get T In this section, we present the methods of learning the N−car clustersusingk-means.Theobtainedclustersareusedtospecify structures of And-Or model by mining contextual patterns theAnd-nodesatLayer1inFig.3.ThenumberofclusterT is andocclusionconfigurationsinthepositivetrainingdataset. specified empirically for different training datasets in our experiments. 4.1 GeneratingMulti-carTrainingSamples In Fig. 5 (top), we visualize the clustering results for PositiveSamples.DenotebyD+ ={(I1,B1),··· ,(In,Bn)} D2+−car ontheKITTI[1]andtheParkingLotdatasets.Each the positive training dataset with Bi = {Bij = set of color points represents a 2-car context pattern. In (xji,yij,wij,hji)}kj=i1 being the set of ki annotated single car theKITTIdataset,wecanobservetherearesomecar-to-car bound boxes in image Ii. Here, (x,y) is the left-top corner “peak”modesinthedataset(similartotheanalysesin[14]), and(w,h)thewidthandheight. while the context patterns are more diverse in the Parking DenotethesetofN-carpositivesamplesby, Lotdataset. DN+-car ={(Ii,BiJ);|J|=N,BiJ ⊆Bi,i∈[1,n]}. (5) 4.3 MiningOcclusionConfigurations where all the Ii’s have more than N annotated single cars Inthissectionwepresentthemethodoflearningocclusion (i.e.,ki ≥N).Wehave, configurations for single cars in Layer 3 and 4 in Fig.3. i) D1+−car consists of all the single car bounding boxes We learn the occlusion configurations automatically from whichdonotoverlaptheotheronesinthesameimage.For a large number of occlusion configurations generated by N ≥2,D+ isgeneratediteratively. CAD simulations. Note that the synthetic data are used N−car ii)IngeneratingD+ (seeFig.4(a)),foreachpositive to learn the occlusion configurations, while the appearance 2−car image (Ii,Bi) ∈ D+ with ki ≥ 2, we enumerate all valid andgeometryparametersarestilllearnedfromrealdata. FORREVIEW:IEEETRANSACTIONSONPATTERNANALYSISANDMACHINEINTELLIGENCE 7 Fig.5. Left-Top: 2-car context patterns on the KITTI dataset [1] and self-collected Parking Lot dataset. Each context pattern is represented by a specific color set, and each circle stands for the center ofeachcluster.Left-Bottom: Overlap ratio histograms of the KITTI dataset and the Parking Lot dataset (we show the occluded cases only). Right: some cropped examples with different occlusions. The 2 boundingboxesinacarpair are shown in red and blue respectively.(Bestviewedin color). 4.3.1 GeneratingOcclusionConfigurations 4.3.3 RefiningtheAnd-OrStructure As mentioned in Sec.1.2.1, we choose to put 3 cars in TheinitialAnd-Ormodelislargeandredundant,sinceithas many duplicated occlusion configurations (i.e. duplicated generatingocclusionconfigurations.Specifically,wechoose rows in D) and a combinatorial number of part composi- the center and 2 other randomly selected positions on a 3×3grid,andputcarsaroundthesegridpointstosimulate tions. In the following, we will pursue a compact And-Or structure.Theproblemcanbeformulatedas: occlusions.SeesomeexamplesinFig.2. The occlusion configurations reflect the four factors: car N (cid:88) type t, orientation ρ , relative position r and camera view min |vi−vi(G)|22 +λ|G | (7) Π. To generate an occlusion configuration, we randomly i assign values for these factors, where for each car with type i, ρi ∈ {frontal,rear}, ri = ri(0) + dr, where ri(0) witshmeroesvtiaipsptrhoexiim-thatreoowccolufstihoendcaotnafimguatrraitxioDn,gve(nGe)rarteetdurbnys is the nominated position for the i-th car on the 3 × 3 the And-Or graph (AOG), |G| is the number of nodes and grid, and dr = (dx,dy) is the relative distance (along x edges in the structure, and λ is the trade-off parameter axis and y axis) between sampled position and nominated balancingthemodelprecisionandcomplexity.Ineachview, position of the i-th car. The camera view is in the range of weassumethenumberofocclusionbranchesisnotgreater azimuth∈[0,2π]andelevation∈[0,π/4],wediscretizethe thanK(=4). view space into B view bins uniformly along the azimuth We solve Eqn.7 using a modified graph compression angle.Inthesynthesizedconfigurations,apartistreatedas algorithm similar to [56]. As illustrated in the right side in occludedif60%ofitsareaisnotvisible. Fig.6, the algorithm starts from the initial And-Or model, anditerativelycombinesbranchesiftheintroducedlosswas 4.3.2 ConstructingtheInitialAnd-OrmodelofSingleCars smaller than the decrements in complexity term λ|G|. This processisequivalenttoiterativelyfindinglargeblocksof1s With the part-level visibility information, we compute two onthecorrespondingdatamatrixthroughrowandcolumn vectors for each occlusion configuration: The first is a (17 permutations,whereanexampleisshowninthebottomin parts×B camera views) dimension binary valued vector (cid:126)v Fig.6. As there are consistently visible parts for each view, for the visibilities of parts; and the second is a real valued the algorithm will quickly converge to the structure shown ((1root+17parts)×B cameraviews×4)dimensionvector inFig.3. (cid:126)bfortheboundingboxesandparts.Inbothvectors,entries With the refined And-Or model, we compute occlu- correspondingtoinvisiblepartsaresetto0. sion configurations (i.e., the consistently visible parts and DenotingM asthedimensionofthevectorvecv,andby optional occluded parts) in each view. In addition, the stackingvecvforN occlusionconfigurations,wecangetan bounding box size and nominal position of each Terminal- N ×M occlusionmatrixD,wherethefirstfewrowsofthis node w.r.t. its parent And-node can also be estimated by matrix for B = 8 is shown in the right side in Fig.6. Note geometric means of corresponding values in the vector thatwehavepartitionedtheviewspaceintoBviews,sofor (cid:126)b. These information will be used to initialize the latent eachrow,thevisiblepartsalwaysconcentrateinasegment variablesofourmodelinlearningtheparameters. ofthevectorrepresentingthatview. Variants of And-Or Models. We will test our model In learning an initial And-Or model, each row in D using two types of specifications to be consistent with our corresponds to a small subtree of the root OR node. In twopreviousconferencepapers,oneiscalledAnd-OrStruc- particular,eachsubtreeconsistsofanAnd-nodeastheroot ture [6] for occlusion modeling based on CAD simulation and a set of terminal nodes as its children. An example of withoutmulti-carcontextcomponents,andtheothercalled the data matrix and corresponding initial And-Or model is Hierarchical And-Or Model [7] for occlusion and context. We showninthemiddleinFig.6. also compare two methods of part selection in hierarchical FORREVIEW:IEEETRANSACTIONSONPATTERNANALYSISANDMACHINEINTELLIGENCE 8 1 Visibility of Parts (0/1 assignment) A B C D E F G H Initial AOG 1 2 3 … 2 4 1 2 8 9 5 6 7 3 8 A B C D E F G H 9 A F H B C D E G Refined AOG 1 4 3 4 2 7 5 6 8 N Exa 92 1 … 9 m 7 p le 5 A F H B C D E G s Occlusion Data Matrix Simulated Occlusion Configurations Fig. 6. Illustration of learning occlusion configurations. It consists of three components: (i) Generating occlusion configurations using CAD simulations with 17 semantic parts in total; (ii) Learning the initial And-Or structure based on the data matrix constructed from the simulated occlusionconfigurations.Eachrowofthedatamatrixrepresentsanexampleandthecolumnsrepresentthevisibilityofthe17semanticparts(a white/grayentrydenotesapartisvisible/invisible.EachexampleisrepresentedbyanAnd-nodeasonechildoftherootOr-node;(iii)Refiningthe initialAnd-Orstructureusinggraphcompressionalgorithm[56]toseektheconsistentlyvisibleparts(e.g.,X)andoptionalpartclusters(e.g.,Y andZ). And-Ormodel,oneisbasedonthegreedypartsasdonein Theobjectivefunctiontobeminimizedisdefinedby, theDPM[17],denotedbyAOG+Greedy,andtheotherbased ontheproposedCADsimulation,denotedbyAOG+CAD. E(Θ)= 12(cid:107)Θ(cid:107)2+C(cid:88)M L(cid:48)(Θ,xi,yi) (8) i=1 5 LEARNING PARAMETERS where xi ∈ DN+−car represents a training sample (N ≥ 1) With the learned And-Or structure, we adopt the andyiistheN boundingbox(es).L(cid:48)(Θ,x,y)isthesurrogate WLSSVM method [15] in learning the parameters Θ = lossfunction, (Θapp,Θdef,Θbias)(forappearance,deformationandbias). L(cid:48)(Θ,x,y)= max[score(x,pt;Θ)+L (y,box(pt))]− When the occlusion configurations are mined by CAD margin pt∈ΩG simulations (i.e., for the two model specifications, And-Or max[score(x,pt;Θ)−Loutput(y,box(pt))] (9) StructureandAOG+CAD),wewilluseboththeStep0and pt∈ΩG Step1belowinlearningparameters,otherwiseweuseStep where ΩG is the space of all parse trees derived from the 1only(i.e.,forAOG+Greedy). And-Or model G, score(x,pt;Θ) computes the score of a Step0:InitializingParameterswithSyntheticTraining parse tree as stated in Sec.3, and box(pt) the predicted Data. We learn the initial parameters Θ with synthetic bounding box(es) base on the parse tree. As pointed out training data (see Fig.10). We randomly superimpose the in [15], the loss Lmargin(y,box(pt)) encourages high-loss synthetic positive samples on some randomly selected real outputs to “pop out” of the first term in the RHS, so that images without cars appearing (instead of using white their scores get pushed down. The loss Loutput(y,box(pt)) background directly, see Fig.10) to reduce the appearance suppresseshigh-lossoutputsinthesecondtermintheright gap between the synthetic samples and real car samples. hand side, so the score of a low-loss prediction gets pulled In the synthetic data, the parse tree pt for each multi-car up.Moredetailsarereferredto[15],[16].Ingeneral,sinceL(cid:48) positive sample is known except that the positions of parts inEqn.(9)isnotconvex,theobjectivefunction,Eqn.(8)leads areallowedtodeform. toanonconvexoptimizationproblem.TheWLSSVMadopts Step 1: Learning Parameters with Real Training Data. the CCCP procedure [57] in optimization, which can find a Intherealtrainingdata,weonlyhaveannotatedbounding localoptimaoftheobjective.Thelossfunctionisdefinedby, boxes for single cars. The parse tree pt for each multi-car puorastiitoivnewsahmicphlecaisnhbidedceonmepxucteepdt fboarsetdheomnutlhtie-caanrncootnafitegd- (cid:96)0(cid:96) iiifffyyy==(cid:54)=⊥⊥⊥aaannndddpp∃ttB(cid:54)==∈⊥⊥y boundingboxesofsinglecarsasstatedinSec.4.2.Then,we L(cid:96),τ(y,box(pt))= withov(B,B(cid:48))<τ,∀B(cid:48) ∈box(pt) , ionnittiahleizienitthiaelppaarsraemtreeteefrosrleeaarcnhepdoisnitisvteepsa0m(pfolerethiteheArnbdas-Oedr 0 ify∀(cid:54)=B⊥∈ayndanodv(∃BB,(cid:48)B∈(cid:48))b≥oxτ(p,t) structure and AOG+CAD) or using a similar idea as done (10) inlearningthemixtureofDPMs[17]toinitializethesingle- where ⊥ represents background output and ov(·,·) is the carAnd-nodesforAOG+Greedy.Aftertheinitialization,the intersection-union ratio of two bounding boxes. Following parameters Θ are learned iteratively under the WLSSVM the PASCAL VOC protocol we have Lmargin = L1,0.5 and framework. During learning, we run the DP inference to Loutput = L∞,0.7. In practice, we modify the implementa- assigntheoptimalparsetreesformulti-carpositivesamples. tionin[18]forourlossformulation. FORREVIEW:IEEETRANSACTIONSONPATTERNANALYSISANDMACHINEINTELLIGENCE 9 603ωEBE百五』ω85nunυnununυ 『吨。内Rda Street[ P-arkinCga rD Oatcacsluedt ed: 67.52%j 2523ωEωmwE-@回』nununυnυ 000qJM呵,-...324R URd[-SEt『·edeAtvkPearaurkgien gC Da陌ataps叫et 咱e:7.04j BEE 44E WEE。=嗣 4tnu0.1 63-- 旦且。3 a 且 z 0.2 0.4 0.6 0.8 10 20 30 40 Overlap Ratio {inte陌ection/uni。n) Cars per Image Pascal[2] KITTI[1] StreetParking Avg.cars 1.75 ≈3 7.04 Fig.8.Precision-recallcurvesonthetestsubsetsplittedfromtheKITTI trainset(Left)andtheParkingLotdataset(Right). Fig.7.Top:Thedistributionofoverlapratioandcarsperimageonthe Street-Parkingdataset.Bottom:Comparisonoftheaveragenumberof carsperimage. label the bounding boxes of single cars in each image. We splitthedatasetintotrainingandtestingsetscontaining440 6 EXPERIMENTS and441images,respectively. Inthissection,weevaluateourmodelsonfourcardetection TheParkingLotDataset.OurStreetParkingCarDataset datasets and three car viewpoint estimation dataset and provides more viewpoints, however, the context and oc- present detail analyses on different aspects of our models. clusion configurations are relatively restricted (most cars We first introduce two self-collected car datasets of street- just compose the head-to-head occlusions). To thoroughly parkingcarsandparking-lotcarsrespectively(Sec.6.1),and evaluateourmodelsintermsofbothcontextandocclusions, then evaluate the detection performance of our models on we collected the parking lot car dataset, which has larger four datasets (Sec. 6.2): the two self-collected datasets, the occlusionvariationsandlargernumberofcarsineachimage KITTIcardataset[1]andthePASCALVOC2007cardataset (seethe4-thand5-throwsinFig.9).Itcontains65training [2].Wefurtheranalyzetheperformanceofourmodelw.r.t. images and 63 testing images. Although the number of different aspects of our models (Sec. 6.3). The performance imagesissmall,thenumberofcarsisnoticeablylarge,with ofcarviewpointestimationispresentedinSec.6.4. 3,346 cars (including left-right mirrored ones) for training Training and Testing Time. In all experiments, we utilize and2,015carsfortesting. a parallel computing technique to train our model. It takes about 9 hours to train an And-Or Structure model and 16 6.2 Detection hourstotrainahierarchicalAnd-OrModelduetoinferring theassignmentsofpartlatentvariablesonpositivetraining WetestourhierarchicalAnd-OrModelonfourchallenging examplesandmininghardnegatives.Fordetection,ittakes datasets. about2and3secondstoprocessanimagewithsizeof640× 480pixelsforaAnd-OrstructureandahierarchicalAnd-Or 6.2.1 ResultsontheKITTIDataset model,respectively. The KITTI dataset [1] contains 7,481 training images and 7,518 testing images, which are captured from an au- 6.1 Datasets tonomousdrivingplatform.Wefollowtheprovidedbench- mark protocol for evaluation. Since the authors of [1] have To test our model on occlusion and context modeling, we not released the test annotations, we test our model in the collectedtwocardatasets4. followingtwosettings. The Street Parking Car Dataset. There are several Training and Testing by Splitting the Trainset. We datasetsfeaturingalargeamountofcarimages[2],[3],[58], randomly split the KITTI trainset into the training and [59], but they are not suitable to evaluating occlusion han- testingsubsetsequally. dling,astheproportionof(moderatelyorheavily)occluded Baseline Methods. Since DPM [17] is a very competitive cars is marginal. The recently proposed KITTI dataset [1] modelwithsourcecodepubliclyavailable,wecompareour contains occluded cars parked along the streets, but it can modelwiththelatestversionofDPM(i.e.,voc-release5[18]). not fully evaluate the ability of our model since the car The number of components are set to 16 as the baseline views are rather fixed as the video sequences are captured methodstrainedin[1],otherparametersaresetasdefault. from a car driving on the road (e.g., no birdeye’s view). ParameterSettings.Weconsidermulti-carcontextualpat- In addition, the average number of cars on each image is ternswiththenumberofcarsN = 1,2.Wesetthenumber still not large enough (mostly 3 cars, see the statistics in of context patterns and occlusion configurations to be 10 the bottom in Fig. 7). To provide a more challenging occlu- and 16, respectively. As a result, the learned hierarchical sion dataset, we collected one emphasizing street parking And-Ormodelhas102-carconfigurationsinlayer1,and16 carswithheavyocclusions,diverseviewpointchangesand singlecarbranchesinlayer3(seeFig.3). muchlargernumberofcarsperimage(seethelasttworows Detection Results. The left figure in Fig. 8 shows the inFig.9).Thedatasetconsistsof881images.Fig.7showsthe precision-recall curves of DPM and our model. Our model boundingboxoverlappingdistributionandaveragenumber outperforms DPM by 9.1% in terms of average precision ofcarsperimage.Forthesimplicityofannotation,weonly (AP).Theperformancegaincomesfrombothprecisionand recall,whichshowstheimportanceofcontextandocclusion 4.http://www.stat.ucla.edu/˜boli/publication/street-parking- release.zipandparking lot release.zip modeling. FORREVIEW:IEEETRANSACTIONSONPATTERNANALYSISANDMACHINEINTELLIGENCE 10 Methods Easy Moderate Hard model. mBow[19] 36.02% 23.76% 18.44% The first 3 rows in Fig. 9 show the qualitative results LSVM-MDPM-us[17] 66.53% 55.42% 41.04% LSVM-MDPM-sv[17],[20] 68.02% 56.48% 44.18% of our model. The red bounding boxes show successful MDPM-un-BB[17] 71.19% 62.16% 48.43% detection, the blue ones missing detection, and the green OC-DPM[14] 74.94% 65.95% 53.86% ones false alarms. In experiments, our model is robust to DPM[18](trainedbyus) 77.24% 56.02% 43.14% detectcarswithheavycar-to-carocclusionsandbackground MV-RGBD-RF[60] 76.40% 69.92% 57.47% clutters. The failure cases are mainly due to extreme oc- SubCat[44] 84.14% 75.46% 59.71% clusions, extremly low resolution, large car deformation 3DVP[45] 87.46% 75.77% 65.38% Regionlets[61] 84.75% 76.45% 59.70% and/orinaccurate(ormultiple)boundingboxlocalization. AOG+Greedy-Half 84.36% 71.88% 59.27% AOG+Greedy-Full 84.80% 75.94% 60.70% 6.2.2 ResultsontheParkingLotDataset TABLE1 EvaluationProtocol.WefollowthePASCALVOCevaluation Performancecomparison(inAP)ontheKITTIbenchmark[1]. protocol [2] with the overlap of intersection over union beinggreaterthanorequalto60%(insteadoforiginal50%). DPM[18] And-OrStructure[6] AOG+Greedy AOG+CAD In practice, we set this threshold to make a compromise AP 52.0% 57.8% 62.1% 65.3% between localization accuracy and detection difficulty. The TABLE2 detected cars with bounding box height smaller than 25 Performancecomparison(inAP)ontheStreetParkingdataset[6]. pixels do not count as false positives as done in [1]. We compare with the latest version of DPM implementation [18]andsetthenumberofcontextualpatternsandocclusion Testing on the KITTI Benchmark. We evaluate our configurationstobe10and18respectively. modelwithtwodifferenttrainingdatasettings:onetrained Detection Results. The right side in Fig. 8 shows the using half training set on the KITTI testset, denoted by performance comparisons between our model and DPM. AOG+Greedy-Half,andtheothertrainedwithfulltraining Our model obtains 55.2% in AP, which outperforms the set, denoted by AOG+Greedy-Full (which has 16 context latest version of DPM by 10.9%. The fourth and fifth rows patternsand32occlusionconfigurations). inFig.9showthequalitativeresults.Ourmodeliscapable The benchmark has three subsets (Easy, Moderate, Hard) ofdetectingcarswithdifferentocclusionsandviewpoints. w.r.t the difficulty of object size, occlusion and trunca- tion. All methods are ranked based on performance in 6.2.3 ResultsontheStreetParkingDataset the moderately difficult subset. Our entry in the bench- To compare with the benchmark methods, we follow the mark is “AOG”. Table 1 shows the detection results of evaluationprotocolprovidedin[6]. ourmodelandotherstate-of-the-artmodels.Here,weomit Resultsofourmodelandotherbenchmarkmethodsare the CNN-based method, as they are all anonymous sub- shown in Table 2, our hierarchical And-Or model outper- missions. Details of the benchmark results are available at forms DPM [18] and our previous And-Or Structure [6] by http://www.cvlibs.net/datasets/kitti/eval object.php. 10.1% and 4.3% respectively. We think the performance is Our AOG+Greedy-Full outperforms all the DPM-based improvedduetothejointrepresentationofcontextpatterns models. Compared with their best model, OC-DPM [14], and occlusion configurations. The last two rows in Fig. 9 our model improved performance on the three subsets by show some qualitative examples. Our model is capable of 9.86%, 9.99%, and 6.84% respectively. We also compare detecting occluded street-parking cars, meanwhile it also with the baseline DPM trained by ourselves using the voc- hasafewinaccuratedetectionresultsandmissessomecars release5 code [18], and obtain 7.56, 19.92% and 17.56% (mainlyduetolowresolution). performance gains on the three stubsets. For other DPM based methods trained by the benchmark authors, our modeloutperformsthebestone-MDPM-un-BBby13.61%, 6.3 DiagnosingthePerformanceofourModel 13.78%and12.27%respectively. Inthissection,weevaluatevariousaspectstodiagnosethe Our model is comparable with SubCat [44], 3DVP [45] effectsofeachindividualcomponentinourmodel. andRegionlets[61].Weachieveslightlybetterperformance than Regionlets [61] on the Easy and Hard sets, but lose 6.3.1 TheEffectofOcclusionModeling a bit AP on the Moderate set. Though our method obtains Our And-Or Structure model is based on CAD simulation. better rank than 3DVP [45] on the moderately difficult set, Thus in the first analysis, we test the effectiveness of the it performs slightly worse on the easy and hard subsets, learnedAnd-Orstructureinrepresentingdifferentocclusion which shows the promise of 3D occlusion modeling and configurations. To this purpose, we generate a synthetic subcategoryclustering[44],[45]. dataset using 5,040 3-car synthetic images as our training Comparing AOG+Greedy-Half and AOG+Greedy-Full, data, and a mixture of 3,000 3-car and 7-car (placed in we can observe that the major improvement (4.06%) of a 1 × 7 grid) synthetic images as our testing data. For AOG+Greedy-Full comes from the Moderate set, while on each generated image, we add the background from the theEasyandHardsets,weobtainsmallimprovement(0.44% category None of the TU Graz-02 dataset [63] and apply and1.43%,respectively).Theseresultsmeetsomeanalyses Gaussianblurtoreducetheboundaryeffects.Samplesofthe in [62], which indicate there are still large potential im- training and testing data are shown on the left and middle provementonobjectrepresentation,andmucheffortshould inFig.10.Inexperimentalcomparisons,thebestDPMhas16 be devoted to improving our current hierarchical And-Or componentsandthebestAnd-Orstructurehas8viewswith