NonamemanuscriptNo. (willbeinsertedbytheeditor) Information Pursuit: A Bayesian Framework for Sequential Scene Parsing EhsanJahangiri · ErdemYo¨ru¨k · Rene´ Vidal · LaurentYounes · DonaldGeman 7 Received:date/Accepted:date 1 0 2 Abstract Despiteenormousprogressinobjectdetectionand Keywords InformationPursuit·ObjectRecognition·Con- n classification, the problem of incorporating expected con- volutionalNeuralNetworks·Coarse-to-FineAnnotation· a J textual relationships among object instances into modern BayesianInference 9 recognition systems remains a key challenge. In this work we propose information pursuit, a Bayesian framework for ] sceneparsingthatcombinespriormodelsforthegeometry V ofthesceneandthespatialarrangementofobjectsinstances 1 Introduction C withadatamodelfortheoutputofhigh-levelimageclassi- . s fierstrainedtoanswerspecificquestionsaboutthescene.In Thepastfewyearshaveseendramaticimprovementsinthe c [ theproposedframework,thesceneinterpretationisprogres- performanceofobjectrecognitionsystems,especiallyin2D sivelyrefinedasevidenceaccumulatesfromtheanswersto objectdetectionandclassification.Muchofthisprogresshas 1 asequenceofquestions.Ateachstep,wechoosetheques- been driven by the use of deep learning techniques, which v 3 tion to maximize the mutual information between the new allow for end-to-end learning of multiple layers of low-, 4 answerandthefullinterpretationgiventhecurrentevidence mid- and high-level image features, which are used to pre- 3 obtainedfrompreviousinquiries.Wealsoproposeamethod dict,e.g.,theobject’sclass,its2Dlocation,orits3Dpose, 2 for learning the parameters of the model from synthesized, provided that sufficiently many annotations for the desired 0 . annotated scenes obtained by top-down sampling from an outputareprovidedfortrainingthecorrespondingdeepnet. 1 easy-to-learngenerativescenemodel.Finally,weintroduce Ontheotherhand,automaticsemanticparsingofnatural 0 7 adatabaseofannotatedindoorscenesofdiningroomtables, scenesthattypicallyexhibitcontextualrelationshipsamong 1 whichweusetoevaluatetheproposedapproach. multiple object instances remains a core challenge in com- : v putationalvision.Asanexample,considerthediningroom i tablesceneshowninFigure1,whereitisfairlycommonfor X EhsanJahangiri collectionsofobjectstoappearinaspecificarrangementon r a E-mail:[email protected] thetable.Forinstance,aplatesettingofteninvolvesaplate with a knife, a fork and a spoon to the left or right of the ErdemYo¨ru¨k plate,andaglassinfrontoftheplate.Also,theknife,fork E-mail:[email protected] andspoonoftenappearparalleltoeachotherratherthanin Ren´eVidal arandomconfiguration.Thesecomplexspatialrelationships E-mail:[email protected] amongobjectposesareoftennotcapturedbyexistingdeep networks, which tend to detect each object instance inde- LaurentYounes E-mail:[email protected] pendently.Wearguethatmodelingsuchcontextualrelation- ships is essential for highly accurate semantic parsing be- DonaldGeman cause detecting objects in the context of other objects can E-mail:[email protected] potentially provide more coherent interpretations (e.g., by Center for Imaging Science, Johns Hopkins University, Baltimore, avoiding object detections that are inconsistent with each MD,USA. other). 2 EhsanJahangirietal. Proposed Bayesian Framework: We propose to leverage codesspatialrelationshipsamongdifferentobjectinstances. recentadvancesinobjectclassification,especiallydeeplearn- Thismodeliseasytolearnandsample,butsamplingfrom ingoflow-,mid-andhigh-levelfeatures,tobuildhigh-level itsposteriorismuchharder.Wegetthebestofbothworlds generative models that reason about objects in the scene byusingthesecondmodeltosynthesizealargenumberof rather than features in the image. Specifically, we assume annotatedscenes,whicharethenusedtolearntheparame- wehaveatourdisposalabatteryofclassifierstrainedtoan- tersofthefirstmodel. swerspecificquestionsaboutthescene(e.g.,isthereaplate ProposedSceneParsingStrategy:Dependingonthescene, inthisimagepatch?)andproposeamodelfortheoutputof runningarelativelysmallsubsetofalltheclassifiersmight thesehigh-levelclassifiers. already provide a substantial amount of information about The proposed model is Bayesian, but can be seen as a thescene,perhapsevenasufficientamountforagivenpur- hybrid of learning-based and model-based approaches. By pose. Therefore, we propose to annotate the data sequen- theformer,werefertoparsinganimagebyscanningitwith tially,identifyingandapplyingthemostinformativeclassi- a battery of trained classifiers (e.g., SVMs or deep neural fier(inaninformation-theoreticsense)ateachstepgiventhe nets).Bythelatter,werefertoidentifyinglikelystatesun- accumulatedevidencefromthosepreviouslyapplied. dertheposteriordistributioninaBayesianframeworkwhich Theselectionofqueriesistask-dependent,butsomegen- combinesapriormodeloverinterpretationsandadatamodel eralprinciplescanbearticulated.Wewanttostructurethem based (usually) on low-level image features. In a nutshell, toallowtheparsingproceduretomovefreelyamongdiffer- wemaintainthebatteryofclassifiersandtheBayesianframe- ent levels of semantic and geometric resolution, for exam- workbyreplacingthelow-levelfeatureswithhigh-levelclas- ple toswitch fromanalyzing the sceneas awhole, to local sifiers. This is enabled by defining the latent variables in scrutinyforfinediscrimination,andperhapsbackagainde- one-to-one correspondence with the classifiers. In particu- pendingoncurrentinputandchangesintargetprobabilities lar,therearenolow-levelormid-levelfeaturesinthemodel; asevidenceisacquired.Processingmaybeterminatedatany all variables, hidden and measured, have semantic content. point,ideallyassoonastheposteriordistributionispeaked We refer to the set which indexes the latent variables and aroundacoherentscenedescription,whichmayoccurafter correspondingclassifiersas“queries”andtothelatentvari- onlyasmallfractionoftheclassifiershavebeenexecuted. ables as “annobits”. For example, some annobits might be TheBayesianframeworkprovidesaprincipledwayfor lists of binary indicators of the presence or absence of vis- deciding what evidence to acquire at each step and for co- ible instances from a subset of object categories in a spe- herentlyintegratingtheevidencebyupdatinglikelihoods.At cificimagepatch,andthecorrespondingclassifiersmightbe each step, we select the classifier (equivalently, the query) CNNswhichoutputavectorofweightsforeachofthesecat- which achieves the maximum value of the conditional mu- egories.Annobitscanbeseenasaperfect(noiseless)classi- tualinformationbetweentheglobalsceneinterpretationand fierand,vice-versa,theclassifiercanbeseenasanimperfect anyclassifiergiventheexistingevidence(i.e.,outputofthe (noisy)annobit.Thedatamodelistheconditionaldistribu- classifiersalreadyimplemented).Consequently,theorderof tionofthefamilyofclassifiersgiventhefamilyofannobits. executionisdeterminedonlineduringsceneparsingbysolv- The prior model encodes our expectations about how ing the corresponding optimization problem at each step. scenes are structured, for example encoding preferred spa- The proposed Information Pursuit (IP) strategy then alter- tial arrangements among objects composing a dining room natesbetweenselectingthenextclassifier,applyingittothe tablesetting.Hencetheposteriordistributionservestomod- imagedata,andupdatingtheposteriordistributiononinter- ulate or “contextualize” the raw classifier output. We pro- pretationsgiventhecurrentlycollectedevidence. posetwopriormodels.Thefirstonecombinesapriormodel Application to 2D Object Detection and 3D Pose Esti- of the 3D scene and camera geometry, whose parameters mation in the JHU Table-Setting Dataset: We will use can be encoded by a homography, and a Markov random the proposed IP strategy to detect instances from multiple field (MRF) model of the 2D spatial arrangement of ob- object categories in an image and estimate their 3D poses. ject instances given the homography. The model is moti- Moreprecisely,considera3Dsceneandasemanticdescrip- vated by our particular application to parsing dining room tionconsistingofavariable-lengthlistoftheidentitiesand tablescenes,wheremostobjectslieonthetableplane.This 3Dposesofvisibleinstancesfromapre-determinedfamily model is easy to sample from its posterior, but it is hard to of object categories. We want to recover this list by apply- learntabula-rasaduetolackofmodularityandthereforethe inghigh-levelclassifierstoanobservedimageofthescene needforagreatmanytrainingsamples.Thesecondmodelis acquired from an unknown viewpoint. As a proof of con- based on an attributed graph where each node corresponds cept,wewillfocusonindoorscenesofdinningroomtables, to an object instance that is attributed with a category la- wherethespecificcategoriesareplate,glass,utensilandbot- bel and a pose in the 3D world coordinate system. The at- tle. Such scenes are challenging due to severe occlusion, tributed graph is built on top of a random skeleton that en- complex photometry and intra-class variability. In order to InformationPursuit:ABayesianFrameworkforSequentialSceneParsing 3 trainmodelsandclassifierswehavecollectedandmanually manandJedynak2010)forfacedetectionandlocalization, labeled3000imagesoftablesettingsfromtheweb.Wewill in (Branson et al. 2014) for fine-grained classification, and use this dataset for learning our model, training and test- in(Sznitmanetal.2013)forinstrumenttrackingduringreti- ingtheclassifiers,andevaluatingsystem’sperformance.We nal microsurgery. However, it has not yet been applied to willshowthatwecanmakeaccuratedecisionsaboutexist- problemsofthecomplexityof3Dsceneinterpretation. ing object instances by processing only a small fraction of CNNs,andmoregenerallydeeplearningwithfeaturehi- patches from a given test image. We will also demonstrate erarchies,areeverywhere.CurrentCNNsaredesignedbased thatcoarse-to-finesearchnaturallyemergesfromIP. on the same principles introduced years ago in (Homma PaperContributions:Insummary,thecorecontributionof etal.1988;Lecunetal.1998).Inthepastdecade,moreef- ourworkisaBayesianframeworkforsemanticscenepars- ficientwaystotrainneuralnetworkswithmorelayers(Hin- ing that combines (1) a data model on the output of high- tonetal.2006;Bengioetal.2007;Ranzatoetal.2007)to- levelclassifiersasopposedtolow-levelimagefeatures,(2) getherwithfarlargerannotatedtrainingsets(e.g.,largepub- priormodelsonthescenethatcapturesrichcontextualrela- licimagerepositoriessuchasImageNet(Dengetal.2009)) tionshipsamonginstancesofmultipleobjectcategories,(3) andefficientimplementationsonhigh-performancecomput- a progressive scene annotation strategy driven by stepwise ingsystems,suchasGPUsandlarge-scaledistributedclus- uncertaintyreduction,and(4)adatasetoftablesettings. ters (Dean et al. 2012; Ciresan et al. 2011) resulted in the successofdeeplearningandmorespecificallyCNNs.This Paper Outline: The remainder of the paper is organized hasresultedinimpressiveperformanceofCNNsonanum- as follows. In section 2 we summarize some related work. berofbenchmarksandcompetitionsincludingtheImageNet In section 3 we define the main system variables and for- LargeScaleVisualRecognitionChallenge(ILSVRC)(Rus- mulate information pursuit in mathematical terms. In sec- sakovsky et al. 2015). To achieve better performance, the tion4weintroducetheannobitsandtheannocellhierarchy. networksizehasgrownconstantlyinthepastfewyearsby In section 5 we introduce our prior model on 3D scenes, takingadvantageofthenewerandmorepowerfulcomputa- which includes a prior model on interpretation units and a tionalresources. prior model on scene geometry and camera parameters. In State-of-the-artobjectdetectionsystems(e.g.,RCNNGir- section 6 we introduce a novel scene generation model for shicketal.(2016)andfasterRCNNRenetal.(2015))ini- synthesizing3Dscenes,whichisusedtolearntheparame- tiallygeneratesomeproposalboxeswhicharelikelytocon- tersofthepriormodel.Thealgorithmforsamplingfromthe tainobjectinstances;theseboxesarethenprocessedbythe posterior distribution, a crucial step, is spelled out in sec- CNN for classification, and then regressed to obtain better tion7andtheparticularclassifiers(CNNs)anddatamodel bounding boxes for positive detections. In RCNN Girshick (Dirichlet distributions) we use in our experiments are de- etal.(2016),theproposalsaregeneratedusingthe“selective scribed in section 8. In section 9 we introduce the “JHU search”algorithmUijlingsetal.(2013).Theselectivesearch Table-Setting Dataset”, which is composed of about 3000 algorithm generates candidates by various ways of group- fully annotated scenes, which we use for training the prior ing the output of an initial image segmentation. The faster modelandtheclassifiers.Insection10wepresentcompre- region-basedCNN(fasterRCNN)ofRenetal.(2015)does hensiveexperiments,includingcomparisonsbetweenIPand notusetheselectivesearchalgorithmtogeneratethecandi- usingtheCNNsalone.Finally,thereisaconcludingdiscus- dateboxes;theirnetworkgeneratestheproposalsinternally sioninsection11. in the forward path. These approaches do not use contex- tualrelationstoimprovedisambiguationandpreventincon- 2 RelatedWork sistent interpretations, allow for progressive annotation, or accommodate 3D representations. There is no image seg- TheIPstrategyproposedinthisworkispartiallymotivated mentationinourapproach. by the “divide-and-conquer” search strategy employed by Thereisaconsiderableamountofworkattemptingtoin- humansinplayingparlorandboardgamessuchas“Twenty corporatecontextualreasoningintoobjectrecognition.Fre- Questions,”wheretheclassifierswouldrepresentnoisyan- quently this is accomplished by labeling pairs of regions swers, as well as by the capacity of the human visual sys- obtained from segmentation or image patches using Con- tem to select potential targets in a scene and ignore other ditional Random Fields or Markov Random Fields (Rabi- itemsthroughactsofselectiveattention(SerencesandYan- novich et al. 2007; Mottaghi et al. 2014; Sun et al. 2014; tis 2006; Reynolds et al. 1999). An online algorithm im- Desaietal.2011).Compositionalvision(Gemanetal.2002) plementing the IP strategy was first introduced by Geman embedscontextinabroadersensebyconsideringmoregen- and Jedynak (1996) under the name “active testing” and eral,non-Markovianmodelsrelatedtocontext-sensitivegram- designed specifically for road tracking in satellite images. mars.Whilemostoftheworkisaboutdiscriminativelearn- Sincethen,variationsonactivetestinghaveappearedin(Sznit- ingandreasoningin2D(Choietal.2012;Sunetal.2014; 4 EhsanJahangirietal. Desai et al. 2011; Felzenszwalb et al. 2010; Porway et al. as a constraint on the choice of queries Q. We will further 2010; Hoai and Zisserman 2014; Rabinovich et al. 2007), assume that Y is a sufficient statistic for X in the sense Q Q severalattemptshavebeenmaderecentlyatdesigningmod- that elsthatreasonaboutsurfacesof3Dscenesandtheinterac- P(X |Z,U)=P(X |Y ). (1) tionbetweenobjectsandtheirsupportingsurfaces(Baoetal. Q Q Q 2010; Hoiem et al. 2007; Lee et al. 2010; Silberman et al. WewilluseaBayesianmodel.Thepriormodeliscomposed 2012;Saxenaetal.2009;Liuetal.2014).Ithasbeenshown of a scene model for Z, which encodes knowledge about that reasoning about the underlying 3D layout of the scene spatial arrangements of scene objects, and a camera model is,asexpected,usefulinrecognizinginteractionswithother for W. Combining the prior model P(Z)P(W) with the objects and surfaces (Bao et al. 2010; Hoiem and Savarese datamodelP(X |Y )thenallowsustodevelopinference Q Q 2011).However,mostofthecurrent3Dmodelsdonoten- methodsbasedon(samplesfrom)theposteriorP(Z,W|X ). Q codecontextualrelationsamongobjectsonsupportingsur- While the specific form of these models naturally depends facesbeyondtheircoplanarity. on the application (see section 5 for a description of these models for our applications to tables scenes), the informa- tionpursuitstrategyisgenerallyapplicabletoanypriorand 3 GeneralFramework datamodels,asexplainednext. 3.1 ScenesandQueries 3.2 InformationPursuit LetZ bealimitedsetofpossibleinterpretationsordescrip- tionsofaphysical3DsceneandletI bea2Dimageofthe Let (q ,...,q ) be an ordered sequence of the first k dis- scene.Inthispaper,adescriptionZ ∈Z recordstheidenti- 1 k tinctqueriesandlet(x ,...,x )bepossibleanswersfrom tiesand3Dposesofvisibleinstancesfromapre-determined 1 k the corresponding classifiers (X ,...,X ). Consider the family of object categories C. The scene description is un- q1 qk event known,buttheimageIisobservedandisdeterminedbythe scene together with other, typically unobserved, variables E ={X =x ,...,X =x }, (2) W, including the camera’s intrinsic and extrinsic parame- k q1 1 qk k ters.WewillassumethatZ,W andI arerandomvariables where, q is the index of the query at step (cid:96) of the process (cid:96) definedonacommonprobabilityspace. andx istheobservedresultofapplyingclassifierX onI. (cid:96) (cid:96) The goal is to reconstruct as much information as pos- Therefore,E istheaccumulatedevidenceafterkqueries. k sibleaboutZ fromtheobservationI andtogenerateacor- TheIPstrategyisdefinedrecursively.Thefirstqueryis responding semantic rendering of the scene by visualizing fixedbythemodel: objectinstances.Inoursetting,informationaboutZ issup- q =argmax I(X ,Y ), (3) plied by noisy answers to a series of image-based queries 1 q Q q∈Q from a specified set Q. We assume the true answer Y to a q query q ∈ Q is determined by Z and W; hence, for each whereI isthemutualinformation,whichisdeterminedby q ∈ Q, Yq = fq(Z,W) for some function fq. The depen- thejointdistributionofXq andYQ.Thereafter,fork >1, dencyofY onW allowsthequeriestodependonlocations q q =argmax I(X ,Y |E ) (4) relative to the observed image. We regard Y as providing k q Q k−1 q q∈Q a small unit of information about the scene Z, and hence whichisdeterminedbytheconditionaljointdistributionof assumingasmallsetofpossiblevalues,evenjusttwo,i.e., X andY giventheevidencetodate,i.e.,givenE .Ac- Y ∈ {0,1}correspondingtotheanswers“no”or“yes”to q Q k−1 q cordingto(4)aclassifierwithmaximumexpectedinforma- a binary query. We will refer to every Y as an “annobit” q tion gain given the currently collected evidence is greedily whetherornotq isabinaryquery.Also,foreachsubsetof selectedateachstepofIP. queriesV ⊂Q,wewilldenotethecorrespondingsubsetof Fromthedefinitionofthemutualinformation,wehave annobitsasY = (Y |q ∈ V)andsimilarlyforclassifiers V q X (seebelow). V I(X ,Y |E )=H(Y |E )−H(Y |X ,E ), (5) q Q k−1 Q k−1 Q q k−1 We will progressively estimate the states of the anno- bitsfromamatchedfamilyofimage-basedpredictors.More whereH denotestheShannonentropy.Sincethefirstterm specifically,foreachqueryq ∈ Q,thereisacorresponding on the right-hand side does not depend on q, one sees that classifierX ,whereX = h (I)forsomefunctionh .We the next query is chosen such that adding to the evidence q q q q willassumethateachclassifierhasthesamecomputational the result of applying X to the test image will minimize, q cost;thisisnecessaryforsequentialexplorationbasedonin- onaverage,theuncertaintyaboutY .Onepointofcaution Q formationflowalonetobemeaningful,butcanalsobeseen regarding the notation H(Y |X ,E ): here Y and X Q q k−1 Q q InformationPursuit:ABayesianFrameworkforSequentialSceneParsing 5 arerandomvariables,whileE isafixedevent.Theno- Y .Then,underassumptions1and2,andusingthefactthat k−1 q tationthenreferstotheconditionalentropyofY givenX E onlydependsontherealizationsofX,wehave: Q q k−1 computedundertheconditionalprobabilityP(·|E ),i.e., k−1 H(X |Y ,E ) theexpectation(withrespecttothedistributionofX )ofthe q Q k−1 q (cid:88) entropyofYQunderP(·|Xq =x,Ek−1). = H(Xq|YQ =y,Ek−1)P(YQ =y|Ek−1) Returningtotheinterpretationoftheselectioncriterion, y wecanalsowrite (cid:88) = H(X |Y =y)P(Y =y|E ) q Q Q k−1 (8) y I(X ,Y |E )=H(X |E )−H(X |Y ,E ). (6) q Q k−1 q k−1 q Q k−1 (cid:88) = H(X |Y =y )P(Y =y|E ) q q q Q k−1 Thisimpliesthatthenextquestionisselectedsuchthat: y (cid:88) = H(X |Y =y )P(Y =y |E ). 1. H(X |E ) is large, so that its answer is as unpre- q q q q q k−1 q k−1 dictableaspossiblegiventhecurrentevidence,and yq 2. H(Xq|YQ,Ek−1)issmall,sothatXqispredictablegiven This entropy H(Xq|Yq = yq) can be computed from the thegroundtruth(i.e.,Xq isa“good”classifier). datamodelandthemixtureweightsP(Yq = yq|Ek−1)can beestimatedfromMonteCarlosimulations(seesection7). The two criteria are however balanced, so that one could Similarly,thefirsttermin(6),namelyH(X |E ),canbe accept a relatively poor classifier if it is (currently) highly q k−1 expressedastheentropyofamixture: unpredictable. Depending on the structure of the joint distribution of H(X |E ) q k−1 X and Y, these conditional entropies may not be easy to (cid:88) (9) =− P(X =x|E )logP(X =x|E ) compute. A possible simplification is to make the approxi- q k−1 q k−1 mation of neglecting the error rates of X at the selection x q stage,thereforereplacingXq byYq.Suchanapproximation with leadstoasimplerdefinitionofq ,namely k P(X =x|E ) q k−1 qk = argmax H(Yq|Ek−1). (7) =−(cid:88)P(Xq =x|YQ =y,Ek−1)P(YQ =y|Ek−1).(10) q∈Q\{q1,...,qk−1} y Noticethat(inabove)theX andY arenotassumedtoco- Arguingaswiththesecondtermin(6),i.e.,replacingP(X = q incide in the conditioning event E (which depends on k−1 x|Y = y,E )byP(X = x|Y = y ),thelastexpres- Q k−1 q q q theX variables)sothattheaccuracyoftheclassifiersisstill sionistheentropyofthemixturedistribution accounted for when evaluating the implications of current (cid:88) evidence.Sohereagain,oneprefersaskingquestionswhose P(X =x|Y =y )P(Y =y |E ). (11) q q q q q k−1 (true) answers are unpredictable. For example, one would yq not ask “Is it an urban scene?” after already having got a wherexisfixed.Consequently,givenanexplicitdatamodel, positiveresponseto“Isthereaskyscraper?”norwouldone the information pursuit strategy can be efficiently approxi- ask if there is an object instance from category c in patch matedbysamplingfromtheposteriordistribution. “A”ifwealreadyknowitishighlylikelythatthereisanob- As a final note, we remark that we have used the vari- jectinstancefromcategorycinpatch“B”,asubsetof“A”. ablesY torepresenttheunknownsceneZ.Writing Removing previous questions from the search is important Q withthisapproximation,sincethemutualinformationin(6) H(Z|E )=H(Z|Y ,E )+H(Y |E ), (12) k−1 Q k−1 Q k−1 vanishesinthatcase,butnotnecessarilytheconditionalen- tropyin(7). we see that the residual uncertainty on Z given the current Returningtothegeneralsituation,(6)canbesimplified evidence will only slightly differ from the residual uncer- ifonemakestwoindependenceassumptions: taintyofY assoonastheresidualuncertaintyofZ given Q Y is small, which is a reasonable assumption when the 1. TheclassifiersareconditionallyindependentgivenY ; Q Q numberofannobitsislargeenough. 2. The classifier X is conditionally independent of Y q Q\q Wenowpasstoamorespecificdescriptionofthevari- givenY ,i.e.,thedistributionofX dependsonY only q q Q ablesX,Y,Z andtheirdistributions.Inparticular,thenext throughY . q sectionprovidesourdrivingprinciplesforthechoiceofthe ClearlyH(X |Y ,E )=0ifqueryqbelongstothehis- annobits. We will then discuss the related classifiers, fol- q Q k−1 tory, so assume q (cid:54)∈ {q ,...,q }. In what follows, let lowedbytheconstructionoftheprioranddatamodels,their 1 k−1 y = (y ,q ∈ Q), where y represents a possible value of trainingandtheassociatedsamplingalgorithms. q q 6 EhsanJahangirietal. 4 Annobits 4.1 GeneralPrinciples Thechoiceofthefunctionsf thatdefinetheannobits,Y = q q f (Z,W),q ∈Q,naturallydependsonthespecificapplica- q tion.Theannobitswehaveinmindforsceneinterpretation, and have used in previous related work on a visual Turing test(Gemanetal.2015),fallmainlyintothreecategories: – Scene context annobits: These indicate full scene la- bels, such as “indoor”, “outdoor” or “street”; since our application is focused entirely on “dinning room table settings”wedonotillustratethese. – Part-ofdescriptors:Theseindicatewhetherornotone imageregionisasubsetofanother,e.g.,whetheranim- agepatchispartofatable. – Existenceannobits:Theserelatetothepresenceorab- sence of object instances with certain properties or at- tributes.Themostnumeroussetofannobitsinoursys- temaskwhetherornotinstancesofagivenobjectcate- Fig.1 Someselectedcellsfromdifferentlevelsoftheannocellhier- goryarevisibleinsideaspecifiedregion. archy.Rectangleswithdashedlinesarethenearestneighborpatchesto therectangleswithsolidlinesfromthesamecolor. Functionsoftheseelementarydescriptorscanalsobeofin- terest. For example, we will rely heavily on annobits pro- vidingalistofallobjectcategoriesvisibleinagivenimage “annocell.” Specifically, assuming L = [0,1]2 (by padding region,asdescribedinsection4.3. andnormalizing),Aconsistsofsquarepatchesoffoursizes, 2−l forl ∈{0,1,2,3}.Thepatchesateach“level”overlap: for each level, the row and column shift ratio is 25% i.e., 4.2 AnnocellHierarchy 75%overlapbetweennearestwindows.Thisleadsto1,25, 169,and841patchesforlevels0,1,2,and3respectively,for Recallfromsection3.1thatascenedescriptionZconsistsof atotalof|A|=1036patches.Figure1showssomeofthese theobjectcategoriesand3Dposesofvisibleinstancesfrom regionsselectedfromthefourlevelsofthehierarchy. a pre-determined family of object categories. Here, moti- Usingahierarchicalannocellstructurehastheadvantage vatedbyourapplicationtodiningroomtablesceneswhere ofallowingforcoarse-to-fineexplorationoftheposespace. objects lie in the table plane, we use a 2D representation Note also that, by construction, annocells at low resolution of the object pose, which can be put in one-to-one corre- areunionsofcertainhigh-resolutionones.Thisimpliesthat spondencewithits3Dposeviathehomographyrelatingthe the value of the annobits at low resolution can in turn be imageplaneandthetableplane(seesection5.2fordetails). derivedasmaximumsofhigh-resolutionannobits. More specifically, an object instance is a triple (C,L,D), where C ∈ C denotes the object category in a set of pre- defined categories C, L ∈ L denotes the locations of the centersoftheinstancesintheimagedomainLandD > 0 4.3 ExtendedExistenceAnnobits denotestheirsizesintheimage(e.g.,diameter).Theappar- ent 2D pose space is therefore L×(0,+∞). More refined Duetothenatureoftheclassifiersweuseinourapplication, posescouldobviouslybeconsidered. wealsointroduceannobitsthatlistthecategoriesthathave Todefinethequeries,wedividetheapparentposespace entirelyvisibleinstancesinanannocell,i.e.,thecollection into cells. Specifically, we consider a finite, distinguished subsetofsub-windows,A,andsubsetofsizeintervals,M, Ycat =(Y ,C ∈C). (13) andindexthequeriesq ∈ Qbythetripletq = (C,A,M), A C,A where C ∈ C, A ∈ A, and M ∈ M. For every category C ∈ C, sub-window A ⊂ A and size interval M ∈ M, In addition, we also use category-independent, size-related we let Y = 1 if an instance of category C with size annobits:ForeachannocellA∈AandsizeintervalM∈M, C,A,M in M is visible in A, and Y = 0 otherwise. If M = wedefineabinaryannobitYsc whichindicateswhetheror C,A,M A,M (0,+∞), we simply write Y . We refer to A ∈ A as an nottheaveragesizeoftheobjectspresentinAbelongstoM. C,A InformationPursuit:ABayesianFrameworkforSequentialSceneParsing 7 4.4 ClassifiersforAnnobits The particular image-based predictors of the annobits we useinthetable-settingapplicationaredescribedinfullde- tailinsection8.Someexamplesinclude: – VariablesXcat,A∈A,whichprovideavectorofweights A onC forpredictingYcat. A – VariablesXsc,A∈A,whichprovideaprobabilityvec- A toronMforpredicting(Ysc ,M ∈M). A,M Additional variables Xt,A ∈ A(cid:48) (where A(cid:48) is a subset of A A)willalsobeintroduced.Theyaredesignedtopredictin- formationunitsYt = 1ifmorethanhalfofAoverlapsthe A table.ObservethattheclassifierX assignedtoY doesnot q q necessarily assume the same value as Y . However, this is Fig.2 Tablefittingmesh. q notaproblemsinceweareonlyinterestedintheconditional distributionofX givenY. intoabinaryrandomfieldthatwewillstilldenotebyZ.Let- tingJ denotethesetofcells,aconfigurationcantherefore berepresentedasthebinaryvectorz =(z ,j ∈J,c∈C) j,c 5 PriorModel wherez = 1ifandonlyifanobjectofcategoryciscen- j,c teredinthecellj. Following section 3, the joint distribution of the annobits The configuration z is obviously a discrete representa- (Yq,q ∈ Q)isderivedfromapriormodelonthe3Dscene tionofthescenelayoutrestrictedtoobjectcategoriesC and description, Z, and on camera parameters W. We assume location L. Letting Ω denote the space of all such config- thesevariablestobeindependentandmodelthemseparately. urations, we will use a Gibbs distribution on Ω associated with a family of feature functions ϕ = (ϕ ,i = 1,...,n), i withϕ : Ω (cid:55)→ {0,1},andscalarparametersλ = (λ ,i = i i 5.1 SceneModelP(Z|S) 1,...,n).TheGibbsdistributionthenhasthefollowingform: Motivated by our application to dining room table scenes, p(z)= 1 exp(cid:0)λ·ϕ(z)(cid:1), (14) κ(λ) we assume a fixed dominant plane in the 3D model, and choose a coordinate system Oxyz in R3, such that the xy- whereκ(λ)isthenormalizingfactor(partitionfunction)en- plane coincides with this dominant plane. The scene Z is suringthattheprobabilitiessumuptoone.Figure2shows represented as a set of object instances, assumed to be sit- atableanditsfittedmeshwhereeachofthecellsisa5cm× tingonaboundedregionofthedominantplane,inourcase 5cmsquare. a centered, rectangular table S characterized by its length Weusethefollowingfeatures: andwidth.Recallfromsection4.2thateachobjectinstance i is represented by a category C ∈ C, a location L and – Existencefeatures,whichindicatewhetherornotanin- i i a size D in the image. Here, we assume that objects from stance from a given category is centered anywhere in a i a given category have a fixed size, so that Z = {Z } with givensetofcells,thereforetakingtheform i Z = (C ,L ). The distribution of Z will be defined con- i i i ϕ (z)=max(z ,j ∈J) (15) ditionaltoS,since,forexample,thesizeofS willdirectly J,c j,c impactthenumberofobjectsthatitcansupport.Moregen- withJ ⊂ J.WeconsidersetsJ atthreedifferentgran- erally the table can be replaced by some other variable S ularity levels, illustrated in Figure 3. At the fine level representing more complex properties of the global scene J = {j}isasingleton,sothatϕ (z) = z .Wealso J,c j,c geometry.ForconveniencewesometimesdropS fromour considermiddle-levelsets(3×3arrayoffinecells)and notation. However, most of the model components intro- coarse-levelsets(6×6arrayoffinecells)thatcoverthe ducedbelowdependonS,andtheproposedmodelistobe referenceplanewithoutintersection. understoodconditionaltoS. – Conjunctionfeatures,whichareproductsoftwomiddle- Wepartitionthereferenceplaneintosmallcells(5cm× level existence features (of the same or different cate- 5cminthetable-settingcase)andusebinaryvariablestoin- gories),andthereforesignaltheirco-occurrence: dicatethepresenceofinstancesofobjectcategoriescentered ineachcell.Inotherwords,wediscretizethefamily(C ,L ) ϕ (z)=ϕ (z)ϕ (z). (16) i i J1,c1,J2,c2 J1,c1 J2,c2 8 EhsanJahangirietal. sumedconditionaltoit).Otherattributes(color,style,etc.) canbeincorporatedinasimilarway. 5.2 CameraModelP(W) The second component of the prior model determines the probabilitydistributionoftheextrinsicandintrinsiccamera parameters, such as its pose and focal length, respectively. Thedefinitionoftheseparametersisfairlystandardincom- puter vision (see e.g., Ma et al. (2003)), but the definition of generative models for these parameters is not. In what followswesummarizethetypicaldefinitions,andleavethe detailsofthegenerativemodeltotheAppendix. Rememberthatweassumedafixedcoordinatesystemin 3Dinwhichthexy-planecoincideswiththedominant“hor- izontal” plane. Consider also a second camera coordinate Fig.3 Domainofvarioustypesoffeaturefunctions. system O(cid:48)x(cid:48)y(cid:48)z(cid:48), such that x(cid:48)y(cid:48)-plane is equal to the im- age plane. The extrinsic camera parameters are defined by the pose (R,T) of the camera coordinate system O(cid:48)x(cid:48)y(cid:48)z(cid:48) Tolimitmodelcomplexity,onlypairsJ ,J whosecen- 1 2 relative to the fixed coordinate system Oxyz, where R is tersarelessthanathresholdawayareconsideredwhere the camera rotation, which maps the unit axis vectors of thethresholdcandependonthepairc ,c . 1 2 Oxyz to the unit axis vectors of O(cid:48)x(cid:48)y(cid:48)z(cid:48), and T = OO(cid:48) Invarianceandsymmetryassumptionsaboutthe3Dscene is the translation vector. We parametrize the rotation R by are then encoded as equality constraints among the model three angles ψ = (ψx,ψy,ψz) representing, respectively, parameters thereby reducing model complexity. Grouping counter-clockwiserotationsofthecamera’scoordinatesys- binaryfeaturesϕ withidenticalparametersλ isthenequiv- tem about the x-axis, y-axis, and z-axis of the world coor- i i alenttoconsideringanewsetoffeaturesthatcountthenum- dinatesystem(seeequation(29)forconversionofunitvec- ber of layout configurations satisfying some conditions on torstoangles).Observethatonecanexpressthecoordinates thelocationsandcategories.Fortablesettings,itisnatural m = (x,y,z)(cid:62) of a 3D point in the world coordinate sys- toassumeinvariancebyrotationaroundthecenteroftheta- temasfunctionsofitscoordinatesinthecameracoordinate ble.Henceweassumethatexistencefeatureswhosedomain systemm(cid:48) =(x(cid:48),y(cid:48),z(cid:48))(cid:62)intheformm=Rm(cid:48)+T.Since J isofthesamesizeandlocatedatthesamedistancefrom in our case 3D points lie in a plane N(cid:62)m(cid:48) = d, where N the closest table edge all have the same weights (λ’s), and isthenormaltotheplane(i.e.,table)measuredinthecam- hence the probability only depends on the number of such era coordinate system and d is the distance from the plane instances. to the camera center, we further have m = Hm(cid:48), where H =(R+TN(cid:62)/d)isthehomographybetweenthecamera Wegroupconjunctionfeaturefunctionsbasedonthedis- planeandtheworldplane. tanceofthefirstpatchtotheedgeofthetable,andtherela- Theintrinsiccameraparametersaredefinedbythecoor- tivepositionofthesecondpatch(left,right,front,orback) dinatesofthefocalpoint,(x ,y ,−f),wheref > 0isthe withrespecttothefirstpatch. 0 0 focallengthand(x ,y )istheintersectionoftheprincipal 0 0 axisofthecamerawiththeimageplane,aswellasthepixel Remark1 : The model can be generalized to include pose sizesindirectionsx(cid:48)andy(cid:48),denotedbyγ andγ . x y attributesotherthanlocation,e.g.,orientation,sizeandheight. Thecompletesetofcameraparametersistherefore11- IfΘdenotesthespaceofposes,thenonecanextendthestate dimensional and given by W = (f,γ ,γ ,x ,y ,ψ,T). spaceforz to{0,1}×Θ,interpretingz =(1,θ)asthe x y 0 0 j,c j,c OurgenerativemodelforW assumesthat: presence of an object with category c and pose θ in cell j, andz =(0,θ)astheabsenceofanyobjectwithcategory – Intrinsic camera parameters are independent from ex- j,c c, θ being irrelevant. Features can then be extended to this trinsiccameraparameters. statespacetoprovideajointdistributionthatincludespose. – Pixelsaresquare,i.e.,γ = γ ,butintrinsicparameters x y The simplest approach would be to only extend univariate are otherwise independent. The focal length f is uni- features, so that object poses and other attributes are con- formly distributed between 10 and 40 millimeters, x 0 ditionally independent given their categories and locations (resp. y ) is uniformly distributed between W /4 and 0 p (andthegeometryvariableS,sincethemodelisalwaysas- 3W /4(resp.H /4and3H /4),whereW andH are p p p p p InformationPursuit:ABayesianFrameworkforSequentialSceneParsing 9 thewidthandheightoftheimageinpixels,andγ =γ allow us to train the scene model independently from the x y isuniformlydistributedbetween1/W and1.2/W . unknowntransformationthatmapsittotheimage.Thiscan p p – TheverticalcomponentofT isindependentoftheother be done in several ways. For example, given four points in twoandthedistributionofthehorizontalcomponentsis theimagethataretheprojectionsofthecornersofasquare rotationinvariant.Specifically,lettingT =(T ,T ,T ), inthereferenceplane,onecanreconstruct,uptoascalefac- x y z weassumethat(T −0.3)/2.7followsaBetadistribu- tor,thehomographymappingthisplanetotheimage.Doing z tion so that T ∈ [0.3,3] (expressed in meters). Then, thiswithareasonableaccuracyisrelativelyeasyingeneral z (cid:113) lettingr = T2+T2 denotethedistancebetweenthe for a human annotator, and allows one to invert the outline x y of every flat object on the image that lies on the reference horizontalprojectionofT onthetableplaneandthecen- plane to its 3D shape, up to a scale ambiguity. This ambi- terofthetable,weassumethatr/4followsaBetadistri- guitycanberemovedbyknowingthetruedistancebetween bution.Weassumeindependenceofrandt andinvari- z twopointsinthereferenceplane,andtheirpositionsinthe ancebyrotationaroundtheverticalaxis,whichspecifies image. We used this level of annotation and representation thedistributionofT. forourtablesettings,basedonthefactthatallobjectsofin- – Thedistributionoftherotationanglesψ isdefinedcon- terestwereeitherhorizontal(e.g.,plates),orhadeasilyiden- ditionallytoT.Specifically,weassumethatthecamera tifiablehorizontalcomponents(e.g.,bottomsofbottles),and roughly points towards the center of the scene and the weassumedthatplateshadastandarddiameterof25cmto horizontal direction in the image plane is also horizon- removethescaleambiguity. talinthe3Dcoordinatesystem.Additionaldetailsofthe modelforp(ψ|T)areprovidedintheAppendix. Ascanbeseen,thelevelofannotationrequiredtotrain our prior model is quite high. While we have been able to produce rich annotations for 3,000 images of dining room 5.3 SceneGeometryModelP(S)andGlobalModel tablesettings(seesection9),thisisinsufficienttotrainour model.Toaddressthisissue,inthenextsectionwepropose WeassumethatthescenegeometryS takesvalueinafinite a 3D scene generation model that can be use to generate a setof“templategeometries”thatcoarselycoverallpossible large number of annotations for as many synthetic images situations.Notethatthesetemplatesaredefineduptotrans- as needed. Given the annotations of both synthetic images lation, since we can always assume that the 3D reference (section 6) as well as real images (section 9), the param- frameisplacedinagivenpositionrelativetothegeometry. eters of our prior model are learned using an accelerated For table settings, where the geometry represents the table versionoftherobuststochasticapproximation(Nemirovski itself,ourtemplatesweresimplysquaretableswithsizedis- etal.2009)tomatchempiricalstatisticscalculatedbasedon tributed according to a shifted and scaled Beta distribution top-downsamplesfromthescenegenerationmodel(seeJa- rangingfrom0.5to3meters.Thisroughapproximationwas hangiri(2016)fordetails). sufficientforourpurposes,eventhoughtablesinrealscenes areobviouslymuchmorevariableinshapeandsize. Finally,thejointpriordistributionp(z,s,w) = P(Z = z,S =s,W =w)ofallthevariablesisdefinedby: p(z,s,w)=p(z|s)p(s)p(w). (17) 6 SceneGenerationModel 5.4 LearningthePriorModel Inthissectionweproposea3Dscenegenerationmodelthat ThemodelsforP(S)andP(W)aresimpleenoughthatwe can be used to generate a large number of annotations to specifiedtheirmodelparametersmanually,asdescribedbe- train the prior model described in the section 5. The pro- fore. Therefore, the fundamental challenge is to learn the posedmodelmimicsanaturalsequenceofstepsincompos- priormodelonsceneinterpretationsP(Z|S).Forthispur- ing a scene. First, create spontaneous instances by placing pose, we assume that a training set of annotated images is some objects randomly in the scene; the distribution of lo- available. The annotation for each image consists of a list cations depends on the scene geometry. Then, allow each of object instances, each one labeled by its category (and of these instances to trigger the placement of ancillary ob- possiblyotherattributes)andapparent2Dposerepresented jects,whosecategoriesandattributesaresampledcondition- by an ellipse in the image plane. We also assume that suf- ally,creatinggroupsofcontextuallyrelatedobjects.Thisre- ficient information is provided to propagate the image an- cursiveprocessterminateswhennochildrenarecreated,or notation to a scene annotation in 3D coordinates; this will whenthenumberofiterationsreachesanupper-bound. 10 EhsanJahangirietal. plate bottle 5 17 15 4 3 16 14 13 9 utensil glass 20 6 10 7 2 18 Fig.4 Anexamplemastergraph. 1 12 11 8 6.1 ModelDescriptionUsingaGenerativeAttributed 19 Graph plate glass utensil Toformallydefinethisprocess,wewillusethenotationn= (n ,c ∈ C)torepresentafamilyofintegercountsn ∈ N indcexed by categories, so that n ∈ N|C|. We will aclso let 1 2 3 4 5 6 (cid:80) |n|= c∈Cnc. 7 8 9 10 11 12 13 14 15 16 17 We will assume a probability distribution p(0) on N|C|, andafamilyofsuchdistributionsp(c),c ∈ C.Thesedistri- 18 19 20 butions(whicharedefinedconditionallytoS =s)areused Fig.5 Atable-settingscene(top)anditscorrespondingskeletongraph to decide the number of objects that will be placed in the (bottom) where the categories (plate, bottle, glass, and utensil) are sceneateachstep.Morespecifically: color-codedinthegraph.RootnodesV0initializethegenerativepro- cess; here there are six. The terminal nodes for this instance are 1. p(0)(·|s)istheconditionaljointdistributionofthenum- VT = {6,8,9,10,11,14,15,16,17,18,19,20}.Accordingtothe basegraphn(0) =4,n(0) =0,n(0) =0andn(0) =2. berofobjectinstancesfromeachcategorythatareplaced plate bottle glass utensil initiallyonthescene. 2. Foreachcategoryc ∈ C,p(c)(·|s)isthejointdistribu- gory attribute) to obtain a complete scene description. The tionofthenumbersofnewobjectinstancesthataretrig- probabilitydistributionofG is 0 geredbytheadditionofanobjectinstancefromcategory (cid:89) c.Thesedistributionscanbethoughtofasthebasisdis- p(G |s)= p(c(v))(n(v)|s), (18) 0 tributions in a multi-type branching process (see Mode v∈V\VT (1971)). where V isthe setofterminal nodesand n(v) arethe cat- T Thecomplexityoftheprocessiscontrolledbyamastergraph egory counts of the children of v (graphs being identified that restricts the subset of categories that can be created at uptocategory-invariantisomorphisms).Anexampleofsuch eachstep.Moreformally,thisdirectedgraphhasverticesin graphisprovidedinFigure5. {0}∪C andissuchthatp(v)issupportedbycategoriesthat To complete the description, we need to associate at- arechildrenofthenodev ∈{0}∪C.Adjoining0tothenode tributes to objects, the most important of them being their labelsavoidstreatingp(0)asaspecialcaseinthederivations posesinthe3Dworld,onwhichwefocusnow.IntheMRF below. The master graph we used on table settings is pro- designedforourexperiments,theonlyrelevantinformation vided in Figure 4, where we regard “plate” and “bottle” as aboutposewasthelocationonthetable,a2Dparameter.It thechildrenofcategory0.Notethatsinceweallowsponta- ishoweverpossibletodesignatop-downgenerativemodel neousinstancesfromallcategorieseverycategoryisachild that includes richer information, using for example a 3D tocategory0. ellipsoid. Such representations involve a small number of Theoutputofthisbranchingprocesscanberepresented parameters denoted generically by θ: each vertex v in the as a directed tree G0 = (V,C,E) in which each vertex skeleton graph is attributed by parameters such as its pose v ∈ V isattributedacategorydenotedbyC(v)andE isa denotedbyθ(v).Whenusingellipsoids,θ(v) involveseight setofedges.Therootnodeofthetree,hereafterdenotedby free parameters (five for the shape of the ellipsoid, which 0,essentiallyrepresentstheemptyscenewhose“category” isapositivedefinitesymmetricmatrix,andthreeforitscen- isalsodenotedby0(notethat0 (cid:54)∈ C).Allothernodeshave ter).Fewerparameterswouldbeneededforflatobjects(rep- categories in C. Each non-terminal node v ∈ V has |N(v)| resented by a 2D ellipse), or vertical ones, or objects with children where N(v) ∼ p(c(v))(·|s) so that N(v) of these rotational symmetry. In any case, it is obvious that the dis- c childrenhavecategoryc.WewillrefertoG asaskeleton tributionofanobjectposedependsheavilyonitscategory. 0 tree,whichneedstobecompletedwiththeobjectattributes Inourmodel,contextualinformationisimportant:when (excluding its category since G already includes the cate- placinganobjectrelativetoaparent,theposealsodepends 0