ebook img

Information Pursuit: A Bayesian Framework for Sequential Scene Parsing PDF

9.5 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Information Pursuit: A Bayesian Framework for Sequential Scene Parsing

NonamemanuscriptNo. (willbeinsertedbytheeditor) Information Pursuit: A Bayesian Framework for Sequential Scene Parsing EhsanJahangiri · ErdemYo¨ru¨k · Rene´ Vidal · LaurentYounes · DonaldGeman 7 Received:date/Accepted:date 1 0 2 Abstract Despiteenormousprogressinobjectdetectionand Keywords InformationPursuit·ObjectRecognition·Con- n classification, the problem of incorporating expected con- volutionalNeuralNetworks·Coarse-to-FineAnnotation· a J textual relationships among object instances into modern BayesianInference 9 recognition systems remains a key challenge. In this work we propose information pursuit, a Bayesian framework for ] sceneparsingthatcombinespriormodelsforthegeometry V ofthesceneandthespatialarrangementofobjectsinstances 1 Introduction C withadatamodelfortheoutputofhigh-levelimageclassi- . s fierstrainedtoanswerspecificquestionsaboutthescene.In Thepastfewyearshaveseendramaticimprovementsinthe c [ theproposedframework,thesceneinterpretationisprogres- performanceofobjectrecognitionsystems,especiallyin2D sivelyrefinedasevidenceaccumulatesfromtheanswersto objectdetectionandclassification.Muchofthisprogresshas 1 asequenceofquestions.Ateachstep,wechoosetheques- been driven by the use of deep learning techniques, which v 3 tion to maximize the mutual information between the new allow for end-to-end learning of multiple layers of low-, 4 answerandthefullinterpretationgiventhecurrentevidence mid- and high-level image features, which are used to pre- 3 obtainedfrompreviousinquiries.Wealsoproposeamethod dict,e.g.,theobject’sclass,its2Dlocation,orits3Dpose, 2 for learning the parameters of the model from synthesized, provided that sufficiently many annotations for the desired 0 . annotated scenes obtained by top-down sampling from an outputareprovidedfortrainingthecorrespondingdeepnet. 1 easy-to-learngenerativescenemodel.Finally,weintroduce Ontheotherhand,automaticsemanticparsingofnatural 0 7 adatabaseofannotatedindoorscenesofdiningroomtables, scenesthattypicallyexhibitcontextualrelationshipsamong 1 whichweusetoevaluatetheproposedapproach. multiple object instances remains a core challenge in com- : v putationalvision.Asanexample,considerthediningroom i tablesceneshowninFigure1,whereitisfairlycommonfor X EhsanJahangiri collectionsofobjectstoappearinaspecificarrangementon r a E-mail:[email protected] thetable.Forinstance,aplatesettingofteninvolvesaplate with a knife, a fork and a spoon to the left or right of the ErdemYo¨ru¨k plate,andaglassinfrontoftheplate.Also,theknife,fork E-mail:[email protected] andspoonoftenappearparalleltoeachotherratherthanin Ren´eVidal arandomconfiguration.Thesecomplexspatialrelationships E-mail:[email protected] amongobjectposesareoftennotcapturedbyexistingdeep networks, which tend to detect each object instance inde- LaurentYounes E-mail:[email protected] pendently.Wearguethatmodelingsuchcontextualrelation- ships is essential for highly accurate semantic parsing be- DonaldGeman cause detecting objects in the context of other objects can E-mail:[email protected] potentially provide more coherent interpretations (e.g., by Center for Imaging Science, Johns Hopkins University, Baltimore, avoiding object detections that are inconsistent with each MD,USA. other). 2 EhsanJahangirietal. Proposed Bayesian Framework: We propose to leverage codesspatialrelationshipsamongdifferentobjectinstances. recentadvancesinobjectclassification,especiallydeeplearn- Thismodeliseasytolearnandsample,butsamplingfrom ingoflow-,mid-andhigh-levelfeatures,tobuildhigh-level itsposteriorismuchharder.Wegetthebestofbothworlds generative models that reason about objects in the scene byusingthesecondmodeltosynthesizealargenumberof rather than features in the image. Specifically, we assume annotatedscenes,whicharethenusedtolearntheparame- wehaveatourdisposalabatteryofclassifierstrainedtoan- tersofthefirstmodel. swerspecificquestionsaboutthescene(e.g.,isthereaplate ProposedSceneParsingStrategy:Dependingonthescene, inthisimagepatch?)andproposeamodelfortheoutputof runningarelativelysmallsubsetofalltheclassifiersmight thesehigh-levelclassifiers. already provide a substantial amount of information about The proposed model is Bayesian, but can be seen as a thescene,perhapsevenasufficientamountforagivenpur- hybrid of learning-based and model-based approaches. By pose. Therefore, we propose to annotate the data sequen- theformer,werefertoparsinganimagebyscanningitwith tially,identifyingandapplyingthemostinformativeclassi- a battery of trained classifiers (e.g., SVMs or deep neural fier(inaninformation-theoreticsense)ateachstepgiventhe nets).Bythelatter,werefertoidentifyinglikelystatesun- accumulatedevidencefromthosepreviouslyapplied. dertheposteriordistributioninaBayesianframeworkwhich Theselectionofqueriesistask-dependent,butsomegen- combinesapriormodeloverinterpretationsandadatamodel eralprinciplescanbearticulated.Wewanttostructurethem based (usually) on low-level image features. In a nutshell, toallowtheparsingproceduretomovefreelyamongdiffer- wemaintainthebatteryofclassifiersandtheBayesianframe- ent levels of semantic and geometric resolution, for exam- workbyreplacingthelow-levelfeatureswithhigh-levelclas- ple toswitch fromanalyzing the sceneas awhole, to local sifiers. This is enabled by defining the latent variables in scrutinyforfinediscrimination,andperhapsbackagainde- one-to-one correspondence with the classifiers. In particu- pendingoncurrentinputandchangesintargetprobabilities lar,therearenolow-levelormid-levelfeaturesinthemodel; asevidenceisacquired.Processingmaybeterminatedatany all variables, hidden and measured, have semantic content. point,ideallyassoonastheposteriordistributionispeaked We refer to the set which indexes the latent variables and aroundacoherentscenedescription,whichmayoccurafter correspondingclassifiersas“queries”andtothelatentvari- onlyasmallfractionoftheclassifiershavebeenexecuted. ables as “annobits”. For example, some annobits might be TheBayesianframeworkprovidesaprincipledwayfor lists of binary indicators of the presence or absence of vis- deciding what evidence to acquire at each step and for co- ible instances from a subset of object categories in a spe- herentlyintegratingtheevidencebyupdatinglikelihoods.At cificimagepatch,andthecorrespondingclassifiersmightbe each step, we select the classifier (equivalently, the query) CNNswhichoutputavectorofweightsforeachofthesecat- which achieves the maximum value of the conditional mu- egories.Annobitscanbeseenasaperfect(noiseless)classi- tualinformationbetweentheglobalsceneinterpretationand fierand,vice-versa,theclassifiercanbeseenasanimperfect anyclassifiergiventheexistingevidence(i.e.,outputofthe (noisy)annobit.Thedatamodelistheconditionaldistribu- classifiersalreadyimplemented).Consequently,theorderof tionofthefamilyofclassifiersgiventhefamilyofannobits. executionisdeterminedonlineduringsceneparsingbysolv- The prior model encodes our expectations about how ing the corresponding optimization problem at each step. scenes are structured, for example encoding preferred spa- The proposed Information Pursuit (IP) strategy then alter- tial arrangements among objects composing a dining room natesbetweenselectingthenextclassifier,applyingittothe tablesetting.Hencetheposteriordistributionservestomod- imagedata,andupdatingtheposteriordistributiononinter- ulate or “contextualize” the raw classifier output. We pro- pretationsgiventhecurrentlycollectedevidence. posetwopriormodels.Thefirstonecombinesapriormodel Application to 2D Object Detection and 3D Pose Esti- of the 3D scene and camera geometry, whose parameters mation in the JHU Table-Setting Dataset: We will use can be encoded by a homography, and a Markov random the proposed IP strategy to detect instances from multiple field (MRF) model of the 2D spatial arrangement of ob- object categories in an image and estimate their 3D poses. ject instances given the homography. The model is moti- Moreprecisely,considera3Dsceneandasemanticdescrip- vated by our particular application to parsing dining room tionconsistingofavariable-lengthlistoftheidentitiesand tablescenes,wheremostobjectslieonthetableplane.This 3Dposesofvisibleinstancesfromapre-determinedfamily model is easy to sample from its posterior, but it is hard to of object categories. We want to recover this list by apply- learntabula-rasaduetolackofmodularityandthereforethe inghigh-levelclassifierstoanobservedimageofthescene needforagreatmanytrainingsamples.Thesecondmodelis acquired from an unknown viewpoint. As a proof of con- based on an attributed graph where each node corresponds cept,wewillfocusonindoorscenesofdinningroomtables, to an object instance that is attributed with a category la- wherethespecificcategoriesareplate,glass,utensilandbot- bel and a pose in the 3D world coordinate system. The at- tle. Such scenes are challenging due to severe occlusion, tributed graph is built on top of a random skeleton that en- complex photometry and intra-class variability. In order to InformationPursuit:ABayesianFrameworkforSequentialSceneParsing 3 trainmodelsandclassifierswehavecollectedandmanually manandJedynak2010)forfacedetectionandlocalization, labeled3000imagesoftablesettingsfromtheweb.Wewill in (Branson et al. 2014) for fine-grained classification, and use this dataset for learning our model, training and test- in(Sznitmanetal.2013)forinstrumenttrackingduringreti- ingtheclassifiers,andevaluatingsystem’sperformance.We nal microsurgery. However, it has not yet been applied to willshowthatwecanmakeaccuratedecisionsaboutexist- problemsofthecomplexityof3Dsceneinterpretation. ing object instances by processing only a small fraction of CNNs,andmoregenerallydeeplearningwithfeaturehi- patches from a given test image. We will also demonstrate erarchies,areeverywhere.CurrentCNNsaredesignedbased thatcoarse-to-finesearchnaturallyemergesfromIP. on the same principles introduced years ago in (Homma PaperContributions:Insummary,thecorecontributionof etal.1988;Lecunetal.1998).Inthepastdecade,moreef- ourworkisaBayesianframeworkforsemanticscenepars- ficientwaystotrainneuralnetworkswithmorelayers(Hin- ing that combines (1) a data model on the output of high- tonetal.2006;Bengioetal.2007;Ranzatoetal.2007)to- levelclassifiersasopposedtolow-levelimagefeatures,(2) getherwithfarlargerannotatedtrainingsets(e.g.,largepub- priormodelsonthescenethatcapturesrichcontextualrela- licimagerepositoriessuchasImageNet(Dengetal.2009)) tionshipsamonginstancesofmultipleobjectcategories,(3) andefficientimplementationsonhigh-performancecomput- a progressive scene annotation strategy driven by stepwise ingsystems,suchasGPUsandlarge-scaledistributedclus- uncertaintyreduction,and(4)adatasetoftablesettings. ters (Dean et al. 2012; Ciresan et al. 2011) resulted in the successofdeeplearningandmorespecificallyCNNs.This Paper Outline: The remainder of the paper is organized hasresultedinimpressiveperformanceofCNNsonanum- as follows. In section 2 we summarize some related work. berofbenchmarksandcompetitionsincludingtheImageNet In section 3 we define the main system variables and for- LargeScaleVisualRecognitionChallenge(ILSVRC)(Rus- mulate information pursuit in mathematical terms. In sec- sakovsky et al. 2015). To achieve better performance, the tion4weintroducetheannobitsandtheannocellhierarchy. networksizehasgrownconstantlyinthepastfewyearsby In section 5 we introduce our prior model on 3D scenes, takingadvantageofthenewerandmorepowerfulcomputa- which includes a prior model on interpretation units and a tionalresources. prior model on scene geometry and camera parameters. In State-of-the-artobjectdetectionsystems(e.g.,RCNNGir- section 6 we introduce a novel scene generation model for shicketal.(2016)andfasterRCNNRenetal.(2015))ini- synthesizing3Dscenes,whichisusedtolearntheparame- tiallygeneratesomeproposalboxeswhicharelikelytocon- tersofthepriormodel.Thealgorithmforsamplingfromthe tainobjectinstances;theseboxesarethenprocessedbythe posterior distribution, a crucial step, is spelled out in sec- CNN for classification, and then regressed to obtain better tion7andtheparticularclassifiers(CNNs)anddatamodel bounding boxes for positive detections. In RCNN Girshick (Dirichlet distributions) we use in our experiments are de- etal.(2016),theproposalsaregeneratedusingthe“selective scribed in section 8. In section 9 we introduce the “JHU search”algorithmUijlingsetal.(2013).Theselectivesearch Table-Setting Dataset”, which is composed of about 3000 algorithm generates candidates by various ways of group- fully annotated scenes, which we use for training the prior ing the output of an initial image segmentation. The faster modelandtheclassifiers.Insection10wepresentcompre- region-basedCNN(fasterRCNN)ofRenetal.(2015)does hensiveexperiments,includingcomparisonsbetweenIPand notusetheselectivesearchalgorithmtogeneratethecandi- usingtheCNNsalone.Finally,thereisaconcludingdiscus- dateboxes;theirnetworkgeneratestheproposalsinternally sioninsection11. in the forward path. These approaches do not use contex- tualrelationstoimprovedisambiguationandpreventincon- 2 RelatedWork sistent interpretations, allow for progressive annotation, or accommodate 3D representations. There is no image seg- TheIPstrategyproposedinthisworkispartiallymotivated mentationinourapproach. by the “divide-and-conquer” search strategy employed by Thereisaconsiderableamountofworkattemptingtoin- humansinplayingparlorandboardgamessuchas“Twenty corporatecontextualreasoningintoobjectrecognition.Fre- Questions,”wheretheclassifierswouldrepresentnoisyan- quently this is accomplished by labeling pairs of regions swers, as well as by the capacity of the human visual sys- obtained from segmentation or image patches using Con- tem to select potential targets in a scene and ignore other ditional Random Fields or Markov Random Fields (Rabi- itemsthroughactsofselectiveattention(SerencesandYan- novich et al. 2007; Mottaghi et al. 2014; Sun et al. 2014; tis 2006; Reynolds et al. 1999). An online algorithm im- Desaietal.2011).Compositionalvision(Gemanetal.2002) plementing the IP strategy was first introduced by Geman embedscontextinabroadersensebyconsideringmoregen- and Jedynak (1996) under the name “active testing” and eral,non-Markovianmodelsrelatedtocontext-sensitivegram- designed specifically for road tracking in satellite images. mars.Whilemostoftheworkisaboutdiscriminativelearn- Sincethen,variationsonactivetestinghaveappearedin(Sznit- ingandreasoningin2D(Choietal.2012;Sunetal.2014; 4 EhsanJahangirietal. Desai et al. 2011; Felzenszwalb et al. 2010; Porway et al. as a constraint on the choice of queries Q. We will further 2010; Hoai and Zisserman 2014; Rabinovich et al. 2007), assume that Y is a sufficient statistic for X in the sense Q Q severalattemptshavebeenmaderecentlyatdesigningmod- that elsthatreasonaboutsurfacesof3Dscenesandtheinterac- P(X |Z,U)=P(X |Y ). (1) tionbetweenobjectsandtheirsupportingsurfaces(Baoetal. Q Q Q 2010; Hoiem et al. 2007; Lee et al. 2010; Silberman et al. WewilluseaBayesianmodel.Thepriormodeliscomposed 2012;Saxenaetal.2009;Liuetal.2014).Ithasbeenshown of a scene model for Z, which encodes knowledge about that reasoning about the underlying 3D layout of the scene spatial arrangements of scene objects, and a camera model is,asexpected,usefulinrecognizinginteractionswithother for W. Combining the prior model P(Z)P(W) with the objects and surfaces (Bao et al. 2010; Hoiem and Savarese datamodelP(X |Y )thenallowsustodevelopinference Q Q 2011).However,mostofthecurrent3Dmodelsdonoten- methodsbasedon(samplesfrom)theposteriorP(Z,W|X ). Q codecontextualrelationsamongobjectsonsupportingsur- While the specific form of these models naturally depends facesbeyondtheircoplanarity. on the application (see section 5 for a description of these models for our applications to tables scenes), the informa- tionpursuitstrategyisgenerallyapplicabletoanypriorand 3 GeneralFramework datamodels,asexplainednext. 3.1 ScenesandQueries 3.2 InformationPursuit LetZ bealimitedsetofpossibleinterpretationsordescrip- tionsofaphysical3DsceneandletI bea2Dimageofthe Let (q ,...,q ) be an ordered sequence of the first k dis- scene.Inthispaper,adescriptionZ ∈Z recordstheidenti- 1 k tinctqueriesandlet(x ,...,x )bepossibleanswersfrom tiesand3Dposesofvisibleinstancesfromapre-determined 1 k the corresponding classifiers (X ,...,X ). Consider the family of object categories C. The scene description is un- q1 qk event known,buttheimageIisobservedandisdeterminedbythe scene together with other, typically unobserved, variables E ={X =x ,...,X =x }, (2) W, including the camera’s intrinsic and extrinsic parame- k q1 1 qk k ters.WewillassumethatZ,W andI arerandomvariables where, q is the index of the query at step (cid:96) of the process (cid:96) definedonacommonprobabilityspace. andx istheobservedresultofapplyingclassifierX onI. (cid:96) (cid:96) The goal is to reconstruct as much information as pos- Therefore,E istheaccumulatedevidenceafterkqueries. k sibleaboutZ fromtheobservationI andtogenerateacor- TheIPstrategyisdefinedrecursively.Thefirstqueryis responding semantic rendering of the scene by visualizing fixedbythemodel: objectinstances.Inoursetting,informationaboutZ issup- q =argmax I(X ,Y ), (3) plied by noisy answers to a series of image-based queries 1 q Q q∈Q from a specified set Q. We assume the true answer Y to a q query q ∈ Q is determined by Z and W; hence, for each whereI isthemutualinformation,whichisdeterminedby q ∈ Q, Yq = fq(Z,W) for some function fq. The depen- thejointdistributionofXq andYQ.Thereafter,fork >1, dencyofY onW allowsthequeriestodependonlocations q q =argmax I(X ,Y |E ) (4) relative to the observed image. We regard Y as providing k q Q k−1 q q∈Q a small unit of information about the scene Z, and hence whichisdeterminedbytheconditionaljointdistributionof assumingasmallsetofpossiblevalues,evenjusttwo,i.e., X andY giventheevidencetodate,i.e.,givenE .Ac- Y ∈ {0,1}correspondingtotheanswers“no”or“yes”to q Q k−1 q cordingto(4)aclassifierwithmaximumexpectedinforma- a binary query. We will refer to every Y as an “annobit” q tion gain given the currently collected evidence is greedily whetherornotq isabinaryquery.Also,foreachsubsetof selectedateachstepofIP. queriesV ⊂Q,wewilldenotethecorrespondingsubsetof Fromthedefinitionofthemutualinformation,wehave annobitsasY = (Y |q ∈ V)andsimilarlyforclassifiers V q X (seebelow). V I(X ,Y |E )=H(Y |E )−H(Y |X ,E ), (5) q Q k−1 Q k−1 Q q k−1 We will progressively estimate the states of the anno- bitsfromamatchedfamilyofimage-basedpredictors.More whereH denotestheShannonentropy.Sincethefirstterm specifically,foreachqueryq ∈ Q,thereisacorresponding on the right-hand side does not depend on q, one sees that classifierX ,whereX = h (I)forsomefunctionh .We the next query is chosen such that adding to the evidence q q q q willassumethateachclassifierhasthesamecomputational the result of applying X to the test image will minimize, q cost;thisisnecessaryforsequentialexplorationbasedonin- onaverage,theuncertaintyaboutY .Onepointofcaution Q formationflowalonetobemeaningful,butcanalsobeseen regarding the notation H(Y |X ,E ): here Y and X Q q k−1 Q q InformationPursuit:ABayesianFrameworkforSequentialSceneParsing 5 arerandomvariables,whileE isafixedevent.Theno- Y .Then,underassumptions1and2,andusingthefactthat k−1 q tationthenreferstotheconditionalentropyofY givenX E onlydependsontherealizationsofX,wehave: Q q k−1 computedundertheconditionalprobabilityP(·|E ),i.e., k−1 H(X |Y ,E ) theexpectation(withrespecttothedistributionofX )ofthe q Q k−1 q (cid:88) entropyofYQunderP(·|Xq =x,Ek−1). = H(Xq|YQ =y,Ek−1)P(YQ =y|Ek−1) Returningtotheinterpretationoftheselectioncriterion, y wecanalsowrite (cid:88) = H(X |Y =y)P(Y =y|E ) q Q Q k−1 (8) y I(X ,Y |E )=H(X |E )−H(X |Y ,E ). (6) q Q k−1 q k−1 q Q k−1 (cid:88) = H(X |Y =y )P(Y =y|E ) q q q Q k−1 Thisimpliesthatthenextquestionisselectedsuchthat: y (cid:88) = H(X |Y =y )P(Y =y |E ). 1. H(X |E ) is large, so that its answer is as unpre- q q q q q k−1 q k−1 dictableaspossiblegiventhecurrentevidence,and yq 2. H(Xq|YQ,Ek−1)issmall,sothatXqispredictablegiven This entropy H(Xq|Yq = yq) can be computed from the thegroundtruth(i.e.,Xq isa“good”classifier). datamodelandthemixtureweightsP(Yq = yq|Ek−1)can beestimatedfromMonteCarlosimulations(seesection7). The two criteria are however balanced, so that one could Similarly,thefirsttermin(6),namelyH(X |E ),canbe accept a relatively poor classifier if it is (currently) highly q k−1 expressedastheentropyofamixture: unpredictable. Depending on the structure of the joint distribution of H(X |E ) q k−1 X and Y, these conditional entropies may not be easy to (cid:88) (9) =− P(X =x|E )logP(X =x|E ) compute. A possible simplification is to make the approxi- q k−1 q k−1 mation of neglecting the error rates of X at the selection x q stage,thereforereplacingXq byYq.Suchanapproximation with leadstoasimplerdefinitionofq ,namely k P(X =x|E ) q k−1 qk = argmax H(Yq|Ek−1). (7) =−(cid:88)P(Xq =x|YQ =y,Ek−1)P(YQ =y|Ek−1).(10) q∈Q\{q1,...,qk−1} y Noticethat(inabove)theX andY arenotassumedtoco- Arguingaswiththesecondtermin(6),i.e.,replacingP(X = q incide in the conditioning event E (which depends on k−1 x|Y = y,E )byP(X = x|Y = y ),thelastexpres- Q k−1 q q q theX variables)sothattheaccuracyoftheclassifiersisstill sionistheentropyofthemixturedistribution accounted for when evaluating the implications of current (cid:88) evidence.Sohereagain,oneprefersaskingquestionswhose P(X =x|Y =y )P(Y =y |E ). (11) q q q q q k−1 (true) answers are unpredictable. For example, one would yq not ask “Is it an urban scene?” after already having got a wherexisfixed.Consequently,givenanexplicitdatamodel, positiveresponseto“Isthereaskyscraper?”norwouldone the information pursuit strategy can be efficiently approxi- ask if there is an object instance from category c in patch matedbysamplingfromtheposteriordistribution. “A”ifwealreadyknowitishighlylikelythatthereisanob- As a final note, we remark that we have used the vari- jectinstancefromcategorycinpatch“B”,asubsetof“A”. ablesY torepresenttheunknownsceneZ.Writing Removing previous questions from the search is important Q withthisapproximation,sincethemutualinformationin(6) H(Z|E )=H(Z|Y ,E )+H(Y |E ), (12) k−1 Q k−1 Q k−1 vanishesinthatcase,butnotnecessarilytheconditionalen- tropyin(7). we see that the residual uncertainty on Z given the current Returningtothegeneralsituation,(6)canbesimplified evidence will only slightly differ from the residual uncer- ifonemakestwoindependenceassumptions: taintyofY assoonastheresidualuncertaintyofZ given Q Y is small, which is a reasonable assumption when the 1. TheclassifiersareconditionallyindependentgivenY ; Q Q numberofannobitsislargeenough. 2. The classifier X is conditionally independent of Y q Q\q Wenowpasstoamorespecificdescriptionofthevari- givenY ,i.e.,thedistributionofX dependsonY only q q Q ablesX,Y,Z andtheirdistributions.Inparticular,thenext throughY . q sectionprovidesourdrivingprinciplesforthechoiceofthe ClearlyH(X |Y ,E )=0ifqueryqbelongstothehis- annobits. We will then discuss the related classifiers, fol- q Q k−1 tory, so assume q (cid:54)∈ {q ,...,q }. In what follows, let lowedbytheconstructionoftheprioranddatamodels,their 1 k−1 y = (y ,q ∈ Q), where y represents a possible value of trainingandtheassociatedsamplingalgorithms. q q 6 EhsanJahangirietal. 4 Annobits 4.1 GeneralPrinciples Thechoiceofthefunctionsf thatdefinetheannobits,Y = q q f (Z,W),q ∈Q,naturallydependsonthespecificapplica- q tion.Theannobitswehaveinmindforsceneinterpretation, and have used in previous related work on a visual Turing test(Gemanetal.2015),fallmainlyintothreecategories: – Scene context annobits: These indicate full scene la- bels, such as “indoor”, “outdoor” or “street”; since our application is focused entirely on “dinning room table settings”wedonotillustratethese. – Part-ofdescriptors:Theseindicatewhetherornotone imageregionisasubsetofanother,e.g.,whetheranim- agepatchispartofatable. – Existenceannobits:Theserelatetothepresenceorab- sence of object instances with certain properties or at- tributes.Themostnumeroussetofannobitsinoursys- temaskwhetherornotinstancesofagivenobjectcate- Fig.1 Someselectedcellsfromdifferentlevelsoftheannocellhier- goryarevisibleinsideaspecifiedregion. archy.Rectangleswithdashedlinesarethenearestneighborpatchesto therectangleswithsolidlinesfromthesamecolor. Functionsoftheseelementarydescriptorscanalsobeofin- terest. For example, we will rely heavily on annobits pro- vidingalistofallobjectcategoriesvisibleinagivenimage “annocell.” Specifically, assuming L = [0,1]2 (by padding region,asdescribedinsection4.3. andnormalizing),Aconsistsofsquarepatchesoffoursizes, 2−l forl ∈{0,1,2,3}.Thepatchesateach“level”overlap: for each level, the row and column shift ratio is 25% i.e., 4.2 AnnocellHierarchy 75%overlapbetweennearestwindows.Thisleadsto1,25, 169,and841patchesforlevels0,1,2,and3respectively,for Recallfromsection3.1thatascenedescriptionZconsistsof atotalof|A|=1036patches.Figure1showssomeofthese theobjectcategoriesand3Dposesofvisibleinstancesfrom regionsselectedfromthefourlevelsofthehierarchy. a pre-determined family of object categories. Here, moti- Usingahierarchicalannocellstructurehastheadvantage vatedbyourapplicationtodiningroomtablesceneswhere ofallowingforcoarse-to-fineexplorationoftheposespace. objects lie in the table plane, we use a 2D representation Note also that, by construction, annocells at low resolution of the object pose, which can be put in one-to-one corre- areunionsofcertainhigh-resolutionones.Thisimpliesthat spondencewithits3Dposeviathehomographyrelatingthe the value of the annobits at low resolution can in turn be imageplaneandthetableplane(seesection5.2fordetails). derivedasmaximumsofhigh-resolutionannobits. More specifically, an object instance is a triple (C,L,D), where C ∈ C denotes the object category in a set of pre- defined categories C, L ∈ L denotes the locations of the centersoftheinstancesintheimagedomainLandD > 0 4.3 ExtendedExistenceAnnobits denotestheirsizesintheimage(e.g.,diameter).Theappar- ent 2D pose space is therefore L×(0,+∞). More refined Duetothenatureoftheclassifiersweuseinourapplication, posescouldobviouslybeconsidered. wealsointroduceannobitsthatlistthecategoriesthathave Todefinethequeries,wedividetheapparentposespace entirelyvisibleinstancesinanannocell,i.e.,thecollection into cells. Specifically, we consider a finite, distinguished subsetofsub-windows,A,andsubsetofsizeintervals,M, Ycat =(Y ,C ∈C). (13) andindexthequeriesq ∈ Qbythetripletq = (C,A,M), A C,A where C ∈ C, A ∈ A, and M ∈ M. For every category C ∈ C, sub-window A ⊂ A and size interval M ∈ M, In addition, we also use category-independent, size-related we let Y = 1 if an instance of category C with size annobits:ForeachannocellA∈AandsizeintervalM∈M, C,A,M in M is visible in A, and Y = 0 otherwise. If M = wedefineabinaryannobitYsc whichindicateswhetheror C,A,M A,M (0,+∞), we simply write Y . We refer to A ∈ A as an nottheaveragesizeoftheobjectspresentinAbelongstoM. C,A InformationPursuit:ABayesianFrameworkforSequentialSceneParsing 7 4.4 ClassifiersforAnnobits The particular image-based predictors of the annobits we useinthetable-settingapplicationaredescribedinfullde- tailinsection8.Someexamplesinclude: – VariablesXcat,A∈A,whichprovideavectorofweights A onC forpredictingYcat. A – VariablesXsc,A∈A,whichprovideaprobabilityvec- A toronMforpredicting(Ysc ,M ∈M). A,M Additional variables Xt,A ∈ A(cid:48) (where A(cid:48) is a subset of A A)willalsobeintroduced.Theyaredesignedtopredictin- formationunitsYt = 1ifmorethanhalfofAoverlapsthe A table.ObservethattheclassifierX assignedtoY doesnot q q necessarily assume the same value as Y . However, this is Fig.2 Tablefittingmesh. q notaproblemsinceweareonlyinterestedintheconditional distributionofX givenY. intoabinaryrandomfieldthatwewillstilldenotebyZ.Let- tingJ denotethesetofcells,aconfigurationcantherefore berepresentedasthebinaryvectorz =(z ,j ∈J,c∈C) j,c 5 PriorModel wherez = 1ifandonlyifanobjectofcategoryciscen- j,c teredinthecellj. Following section 3, the joint distribution of the annobits The configuration z is obviously a discrete representa- (Yq,q ∈ Q)isderivedfromapriormodelonthe3Dscene tionofthescenelayoutrestrictedtoobjectcategoriesC and description, Z, and on camera parameters W. We assume location L. Letting Ω denote the space of all such config- thesevariablestobeindependentandmodelthemseparately. urations, we will use a Gibbs distribution on Ω associated with a family of feature functions ϕ = (ϕ ,i = 1,...,n), i withϕ : Ω (cid:55)→ {0,1},andscalarparametersλ = (λ ,i = i i 5.1 SceneModelP(Z|S) 1,...,n).TheGibbsdistributionthenhasthefollowingform: Motivated by our application to dining room table scenes, p(z)= 1 exp(cid:0)λ·ϕ(z)(cid:1), (14) κ(λ) we assume a fixed dominant plane in the 3D model, and choose a coordinate system Oxyz in R3, such that the xy- whereκ(λ)isthenormalizingfactor(partitionfunction)en- plane coincides with this dominant plane. The scene Z is suringthattheprobabilitiessumuptoone.Figure2shows represented as a set of object instances, assumed to be sit- atableanditsfittedmeshwhereeachofthecellsisa5cm× tingonaboundedregionofthedominantplane,inourcase 5cmsquare. a centered, rectangular table S characterized by its length Weusethefollowingfeatures: andwidth.Recallfromsection4.2thateachobjectinstance i is represented by a category C ∈ C, a location L and – Existencefeatures,whichindicatewhetherornotanin- i i a size D in the image. Here, we assume that objects from stance from a given category is centered anywhere in a i a given category have a fixed size, so that Z = {Z } with givensetofcells,thereforetakingtheform i Z = (C ,L ). The distribution of Z will be defined con- i i i ϕ (z)=max(z ,j ∈J) (15) ditionaltoS,since,forexample,thesizeofS willdirectly J,c j,c impactthenumberofobjectsthatitcansupport.Moregen- withJ ⊂ J.WeconsidersetsJ atthreedifferentgran- erally the table can be replaced by some other variable S ularity levels, illustrated in Figure 3. At the fine level representing more complex properties of the global scene J = {j}isasingleton,sothatϕ (z) = z .Wealso J,c j,c geometry.ForconveniencewesometimesdropS fromour considermiddle-levelsets(3×3arrayoffinecells)and notation. However, most of the model components intro- coarse-levelsets(6×6arrayoffinecells)thatcoverthe ducedbelowdependonS,andtheproposedmodelistobe referenceplanewithoutintersection. understoodconditionaltoS. – Conjunctionfeatures,whichareproductsoftwomiddle- Wepartitionthereferenceplaneintosmallcells(5cm× level existence features (of the same or different cate- 5cminthetable-settingcase)andusebinaryvariablestoin- gories),andthereforesignaltheirco-occurrence: dicatethepresenceofinstancesofobjectcategoriescentered ineachcell.Inotherwords,wediscretizethefamily(C ,L ) ϕ (z)=ϕ (z)ϕ (z). (16) i i J1,c1,J2,c2 J1,c1 J2,c2 8 EhsanJahangirietal. sumedconditionaltoit).Otherattributes(color,style,etc.) canbeincorporatedinasimilarway. 5.2 CameraModelP(W) The second component of the prior model determines the probabilitydistributionoftheextrinsicandintrinsiccamera parameters, such as its pose and focal length, respectively. Thedefinitionoftheseparametersisfairlystandardincom- puter vision (see e.g., Ma et al. (2003)), but the definition of generative models for these parameters is not. In what followswesummarizethetypicaldefinitions,andleavethe detailsofthegenerativemodeltotheAppendix. Rememberthatweassumedafixedcoordinatesystemin 3Dinwhichthexy-planecoincideswiththedominant“hor- izontal” plane. Consider also a second camera coordinate Fig.3 Domainofvarioustypesoffeaturefunctions. system O(cid:48)x(cid:48)y(cid:48)z(cid:48), such that x(cid:48)y(cid:48)-plane is equal to the im- age plane. The extrinsic camera parameters are defined by the pose (R,T) of the camera coordinate system O(cid:48)x(cid:48)y(cid:48)z(cid:48) Tolimitmodelcomplexity,onlypairsJ ,J whosecen- 1 2 relative to the fixed coordinate system Oxyz, where R is tersarelessthanathresholdawayareconsideredwhere the camera rotation, which maps the unit axis vectors of thethresholdcandependonthepairc ,c . 1 2 Oxyz to the unit axis vectors of O(cid:48)x(cid:48)y(cid:48)z(cid:48), and T = OO(cid:48) Invarianceandsymmetryassumptionsaboutthe3Dscene is the translation vector. We parametrize the rotation R by are then encoded as equality constraints among the model three angles ψ = (ψx,ψy,ψz) representing, respectively, parameters thereby reducing model complexity. Grouping counter-clockwiserotationsofthecamera’scoordinatesys- binaryfeaturesϕ withidenticalparametersλ isthenequiv- tem about the x-axis, y-axis, and z-axis of the world coor- i i alenttoconsideringanewsetoffeaturesthatcountthenum- dinatesystem(seeequation(29)forconversionofunitvec- ber of layout configurations satisfying some conditions on torstoangles).Observethatonecanexpressthecoordinates thelocationsandcategories.Fortablesettings,itisnatural m = (x,y,z)(cid:62) of a 3D point in the world coordinate sys- toassumeinvariancebyrotationaroundthecenteroftheta- temasfunctionsofitscoordinatesinthecameracoordinate ble.Henceweassumethatexistencefeatureswhosedomain systemm(cid:48) =(x(cid:48),y(cid:48),z(cid:48))(cid:62)intheformm=Rm(cid:48)+T.Since J isofthesamesizeandlocatedatthesamedistancefrom in our case 3D points lie in a plane N(cid:62)m(cid:48) = d, where N the closest table edge all have the same weights (λ’s), and isthenormaltotheplane(i.e.,table)measuredinthecam- hence the probability only depends on the number of such era coordinate system and d is the distance from the plane instances. to the camera center, we further have m = Hm(cid:48), where H =(R+TN(cid:62)/d)isthehomographybetweenthecamera Wegroupconjunctionfeaturefunctionsbasedonthedis- planeandtheworldplane. tanceofthefirstpatchtotheedgeofthetable,andtherela- Theintrinsiccameraparametersaredefinedbythecoor- tivepositionofthesecondpatch(left,right,front,orback) dinatesofthefocalpoint,(x ,y ,−f),wheref > 0isthe withrespecttothefirstpatch. 0 0 focallengthand(x ,y )istheintersectionoftheprincipal 0 0 axisofthecamerawiththeimageplane,aswellasthepixel Remark1 : The model can be generalized to include pose sizesindirectionsx(cid:48)andy(cid:48),denotedbyγ andγ . x y attributesotherthanlocation,e.g.,orientation,sizeandheight. Thecompletesetofcameraparametersistherefore11- IfΘdenotesthespaceofposes,thenonecanextendthestate dimensional and given by W = (f,γ ,γ ,x ,y ,ψ,T). spaceforz to{0,1}×Θ,interpretingz =(1,θ)asthe x y 0 0 j,c j,c OurgenerativemodelforW assumesthat: presence of an object with category c and pose θ in cell j, andz =(0,θ)astheabsenceofanyobjectwithcategory – Intrinsic camera parameters are independent from ex- j,c c, θ being irrelevant. Features can then be extended to this trinsiccameraparameters. statespacetoprovideajointdistributionthatincludespose. – Pixelsaresquare,i.e.,γ = γ ,butintrinsicparameters x y The simplest approach would be to only extend univariate are otherwise independent. The focal length f is uni- features, so that object poses and other attributes are con- formly distributed between 10 and 40 millimeters, x 0 ditionally independent given their categories and locations (resp. y ) is uniformly distributed between W /4 and 0 p (andthegeometryvariableS,sincethemodelisalwaysas- 3W /4(resp.H /4and3H /4),whereW andH are p p p p p InformationPursuit:ABayesianFrameworkforSequentialSceneParsing 9 thewidthandheightoftheimageinpixels,andγ =γ allow us to train the scene model independently from the x y isuniformlydistributedbetween1/W and1.2/W . unknowntransformationthatmapsittotheimage.Thiscan p p – TheverticalcomponentofT isindependentoftheother be done in several ways. For example, given four points in twoandthedistributionofthehorizontalcomponentsis theimagethataretheprojectionsofthecornersofasquare rotationinvariant.Specifically,lettingT =(T ,T ,T ), inthereferenceplane,onecanreconstruct,uptoascalefac- x y z weassumethat(T −0.3)/2.7followsaBetadistribu- tor,thehomographymappingthisplanetotheimage.Doing z tion so that T ∈ [0.3,3] (expressed in meters). Then, thiswithareasonableaccuracyisrelativelyeasyingeneral z (cid:113) lettingr = T2+T2 denotethedistancebetweenthe for a human annotator, and allows one to invert the outline x y of every flat object on the image that lies on the reference horizontalprojectionofT onthetableplaneandthecen- plane to its 3D shape, up to a scale ambiguity. This ambi- terofthetable,weassumethatr/4followsaBetadistri- guitycanberemovedbyknowingthetruedistancebetween bution.Weassumeindependenceofrandt andinvari- z twopointsinthereferenceplane,andtheirpositionsinthe ancebyrotationaroundtheverticalaxis,whichspecifies image. We used this level of annotation and representation thedistributionofT. forourtablesettings,basedonthefactthatallobjectsofin- – Thedistributionoftherotationanglesψ isdefinedcon- terestwereeitherhorizontal(e.g.,plates),orhadeasilyiden- ditionallytoT.Specifically,weassumethatthecamera tifiablehorizontalcomponents(e.g.,bottomsofbottles),and roughly points towards the center of the scene and the weassumedthatplateshadastandarddiameterof25cmto horizontal direction in the image plane is also horizon- removethescaleambiguity. talinthe3Dcoordinatesystem.Additionaldetailsofthe modelforp(ψ|T)areprovidedintheAppendix. Ascanbeseen,thelevelofannotationrequiredtotrain our prior model is quite high. While we have been able to produce rich annotations for 3,000 images of dining room 5.3 SceneGeometryModelP(S)andGlobalModel tablesettings(seesection9),thisisinsufficienttotrainour model.Toaddressthisissue,inthenextsectionwepropose WeassumethatthescenegeometryS takesvalueinafinite a 3D scene generation model that can be use to generate a setof“templategeometries”thatcoarselycoverallpossible large number of annotations for as many synthetic images situations.Notethatthesetemplatesaredefineduptotrans- as needed. Given the annotations of both synthetic images lation, since we can always assume that the 3D reference (section 6) as well as real images (section 9), the param- frameisplacedinagivenpositionrelativetothegeometry. eters of our prior model are learned using an accelerated For table settings, where the geometry represents the table versionoftherobuststochasticapproximation(Nemirovski itself,ourtemplatesweresimplysquaretableswithsizedis- etal.2009)tomatchempiricalstatisticscalculatedbasedon tributed according to a shifted and scaled Beta distribution top-downsamplesfromthescenegenerationmodel(seeJa- rangingfrom0.5to3meters.Thisroughapproximationwas hangiri(2016)fordetails). sufficientforourpurposes,eventhoughtablesinrealscenes areobviouslymuchmorevariableinshapeandsize. Finally,thejointpriordistributionp(z,s,w) = P(Z = z,S =s,W =w)ofallthevariablesisdefinedby: p(z,s,w)=p(z|s)p(s)p(w). (17) 6 SceneGenerationModel 5.4 LearningthePriorModel Inthissectionweproposea3Dscenegenerationmodelthat ThemodelsforP(S)andP(W)aresimpleenoughthatwe can be used to generate a large number of annotations to specifiedtheirmodelparametersmanually,asdescribedbe- train the prior model described in the section 5. The pro- fore. Therefore, the fundamental challenge is to learn the posedmodelmimicsanaturalsequenceofstepsincompos- priormodelonsceneinterpretationsP(Z|S).Forthispur- ing a scene. First, create spontaneous instances by placing pose, we assume that a training set of annotated images is some objects randomly in the scene; the distribution of lo- available. The annotation for each image consists of a list cations depends on the scene geometry. Then, allow each of object instances, each one labeled by its category (and of these instances to trigger the placement of ancillary ob- possiblyotherattributes)andapparent2Dposerepresented jects,whosecategoriesandattributesaresampledcondition- by an ellipse in the image plane. We also assume that suf- ally,creatinggroupsofcontextuallyrelatedobjects.Thisre- ficient information is provided to propagate the image an- cursiveprocessterminateswhennochildrenarecreated,or notation to a scene annotation in 3D coordinates; this will whenthenumberofiterationsreachesanupper-bound. 10 EhsanJahangirietal. plate bottle 5 17 15 4 3 16 14 13 9 utensil glass 20 6 10 7 2 18 Fig.4 Anexamplemastergraph. 1 12 11 8 6.1 ModelDescriptionUsingaGenerativeAttributed 19 Graph plate glass utensil Toformallydefinethisprocess,wewillusethenotationn= (n ,c ∈ C)torepresentafamilyofintegercountsn ∈ N indcexed by categories, so that n ∈ N|C|. We will aclso let 1 2 3 4 5 6 (cid:80) |n|= c∈Cnc. 7 8 9 10 11 12 13 14 15 16 17 We will assume a probability distribution p(0) on N|C|, andafamilyofsuchdistributionsp(c),c ∈ C.Thesedistri- 18 19 20 butions(whicharedefinedconditionallytoS =s)areused Fig.5 Atable-settingscene(top)anditscorrespondingskeletongraph to decide the number of objects that will be placed in the (bottom) where the categories (plate, bottle, glass, and utensil) are sceneateachstep.Morespecifically: color-codedinthegraph.RootnodesV0initializethegenerativepro- cess; here there are six. The terminal nodes for this instance are 1. p(0)(·|s)istheconditionaljointdistributionofthenum- VT = {6,8,9,10,11,14,15,16,17,18,19,20}.Accordingtothe basegraphn(0) =4,n(0) =0,n(0) =0andn(0) =2. berofobjectinstancesfromeachcategorythatareplaced plate bottle glass utensil initiallyonthescene. 2. Foreachcategoryc ∈ C,p(c)(·|s)isthejointdistribu- gory attribute) to obtain a complete scene description. The tionofthenumbersofnewobjectinstancesthataretrig- probabilitydistributionofG is 0 geredbytheadditionofanobjectinstancefromcategory (cid:89) c.Thesedistributionscanbethoughtofasthebasisdis- p(G |s)= p(c(v))(n(v)|s), (18) 0 tributions in a multi-type branching process (see Mode v∈V\VT (1971)). where V isthe setofterminal nodesand n(v) arethe cat- T Thecomplexityoftheprocessiscontrolledbyamastergraph egory counts of the children of v (graphs being identified that restricts the subset of categories that can be created at uptocategory-invariantisomorphisms).Anexampleofsuch eachstep.Moreformally,thisdirectedgraphhasverticesin graphisprovidedinFigure5. {0}∪C andissuchthatp(v)issupportedbycategoriesthat To complete the description, we need to associate at- arechildrenofthenodev ∈{0}∪C.Adjoining0tothenode tributes to objects, the most important of them being their labelsavoidstreatingp(0)asaspecialcaseinthederivations posesinthe3Dworld,onwhichwefocusnow.IntheMRF below. The master graph we used on table settings is pro- designedforourexperiments,theonlyrelevantinformation vided in Figure 4, where we regard “plate” and “bottle” as aboutposewasthelocationonthetable,a2Dparameter.It thechildrenofcategory0.Notethatsinceweallowsponta- ishoweverpossibletodesignatop-downgenerativemodel neousinstancesfromallcategorieseverycategoryisachild that includes richer information, using for example a 3D tocategory0. ellipsoid. Such representations involve a small number of Theoutputofthisbranchingprocesscanberepresented parameters denoted generically by θ: each vertex v in the as a directed tree G0 = (V,C,E) in which each vertex skeleton graph is attributed by parameters such as its pose v ∈ V isattributedacategorydenotedbyC(v)andE isa denotedbyθ(v).Whenusingellipsoids,θ(v) involveseight setofedges.Therootnodeofthetree,hereafterdenotedby free parameters (five for the shape of the ellipsoid, which 0,essentiallyrepresentstheemptyscenewhose“category” isapositivedefinitesymmetricmatrix,andthreeforitscen- isalsodenotedby0(notethat0 (cid:54)∈ C).Allothernodeshave ter).Fewerparameterswouldbeneededforflatobjects(rep- categories in C. Each non-terminal node v ∈ V has |N(v)| resented by a 2D ellipse), or vertical ones, or objects with children where N(v) ∼ p(c(v))(·|s) so that N(v) of these rotational symmetry. In any case, it is obvious that the dis- c childrenhavecategoryc.WewillrefertoG asaskeleton tributionofanobjectposedependsheavilyonitscategory. 0 tree,whichneedstobecompletedwiththeobjectattributes Inourmodel,contextualinformationisimportant:when (excluding its category since G already includes the cate- placinganobjectrelativetoaparent,theposealsodepends 0

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.