Complexity of Representation and Inference in Compositional Models with Part Sharing AlanL.Yuille RoozbehMottaghi Depts. ofStatistics,ComputerScience&Psychology DepartmentofComputerScience UniversityofCalifornia,LosAngeles UniversityofCalifornia,LosAngeles 3 [email protected] [email protected] 1 0 2 n Abstract a J Thispaperdescribesserialandparallelcompositionalmodelsofmultipleobjects 6 with part sharing. Objects are built by part-subpart compositions and expressed 1 in terms of a hierarchical dictionary of object parts. These parts are represented onlatticesofdecreasingsizeswhichyieldanexecutivesummarydescription. We ] V describeinferenceandlearningalgorithmsforthesemodels.Weanalyzethecom- C plexityofthismodelintermsofcomputationtime(forserialcomputers)andnum- bersofnodes(e.g.,”neurons”)forparallelcomputers. Inparticular,wecompute . s the complexity gains by part sharing and its dependence on how the dictionary c scales with the level of the hierarchy. We explore three regimes of scaling be- [ havior where the dictionary size (i) increases exponentially with the level, (ii) is 1 determined by an unsupervised compositional learning algorithm applied to real v data, (iii) decreases exponentially with scale. This analysis shows that in some 0 regimes the use of shared parts enables algorithms which can perform inference 6 in time linear in the number of levels for an exponential number of objects. In 5 other regimes part sharing has little advantage for serial computers but can give 3 linearprocessingonparallelcomputers. . 1 0 3 1 Introduction 1 : v Afundamentalproblemofvisionishowtodealwiththeenormouscomplexityofimagesandvisual i X scenes1. Thetotalnumberofpossibleimagesisalmostinfinitelylarge[8]. Thenumberofobjects r isalsohugeandhasbeenestimatedataround30,000[2]. Howcanabiological,orartificial,vision a system deal with this complexity? For example, considering the enormous input space of images andoutputspaceofobjects,howcanhumansinterpretimagesinlessthan150msec[16]? There are three main issues involved. Firstly, how can a visual system be designed so that it can efficientlyrepresentlargeclassesofobjects,includingtheirpartsandsubparts? Secondly,howcan thevisualsystembedesignedsothatitcanrapidlyinferwhichobject,orobjects,arepresentinan inputimageandthepositionsoftheirsubparts? And,thirdly,howcanthisrepresentationbelearnt inanunsupervised,orweaklysupervisedfashion? Inshort, whatvisualarchitecturesenableusto addressthesethreeissues? Manyconsiderationssuggestthatvisualarchitecturesshouldbehierarchical. Thestructureofmam- malianvisualsystemsishierarchicalwiththelowerlevels(e.g.,inareasV1andV2)tunedtosmall image features while the higher levels (i.e. in area IT) are tuned to objects 2. Moreover, as ap- preciatedbypioneerssuchasFukushima[5],hierarchicalarchitectureslendthemselvesnaturallyto 1Similarcomplexityissueswillariseforotherperceptualandcognitivemodalities. 2Butjustbecausemammalianvisualsystemsarehierarchicaldoesnotnecessarilyimplythatthisisthebest designforcomputervisionsystems. efficientrepresentationsofobjectsintermsofpartsandsubpartswhichcanbesharedbetweenmany objects. Hierarchical architectures also lead to efficient learning algorithms as illustrated by deep belieflearningandothers[7].Therearemanyvarietiesofhierarchicalmodelswhichdifferindetails oftheirrepresentationsandtheirlearningandinferencealgorithms[14,7,15,1,13,18,10,3,9]. But,tothebestofourknowledge,therehasbeennodetailedstudyoftheircomplexityproperties. Thispaperprovidesamathematicalanalysisofcompositionalmodels[6],whichareasubclassofthe hierarchicalmodels. Thekeyideaofcompositionalityistoexplicitlyrepresentobjectsbyrecursive composition from parts and subparts. This gives rise to natural learning and inference algorithms whichproceedfromsub-partstopartstoobjects(e.g., inferenceisefficientbecausealegdetector can be used for detecting the legs of cows, horses, and yaks). The explicitness of the object rep- resentationshelpsquantifytheefficiencyofpart-sharingandmakemathematicalanalysispossible. ThecompositionalmodelswestudyarebasedontheworkofL.Zhuandhiscollaborators[19,20] butwemakeseveraltechnicalmodificationsincludingaparallelre-formulationofthemodels. We notethatinpreviouspapers[19,20]therepresentationsofthecompositionalmodelswerelearntin anunsupervisedmanner,whichrelatestothememorizationalgorithmsofValiant[17]. Thispaper does not address learning but instead explores the consequence of the representations which were learnt. Our analysis assumes that objects are represented by hierarchical graphical probability models 3 which are composed from more elementary models by part-subpart compositions. An object – a graphicalmodelwith levels–isdefinedasacompositionofr partswhicharegraphicalmodels H with 1levels. Thesepartsaredefinedrecursivelyintermsofsubpartswhicharerepresentedby H− graphicalmodelsofincreasinglylowerlevels.Itisconvenienttospecifythesecompositionalmodels intermsofasetofdictionaries :h=1,..,, wherethelevel-hpartsindictionary are h h {M H} M composed in terms of level-h 1 parts in dictionary . The highest level dictionaries h 1 representthesetofallobjects.−ThelowestleveldictionMarie−s representtheelementaryfeatMureHs 1 M that can be measured from the input image. Part-subpart composition enables us to construct a verylargenumberofobjectsbydifferentcompositionsofelementsfromthelowest-leveldictionary. It enables us to perform part-sharing during learning and inference, which can lead to enormous reductionsincomplexity,asourmathematicalanalysiswillshow. Therearethreefactorswhichenablecomputationalefficiency.Thefirstispart-sharing,asdescribed above,whichmeansthatweonlyneedtoperforminferenceonthedictionaryelements. Thesecond istheexecutive-summaryprinciple. Thisprincipleallowsustorepresentthestateofapartcoarsely becausewearealsorepresentingthestateofitssubparts(e.g.,anexecutivewillonlywanttoknow that ”there is a horse in the field” and will not care about the precise positions of its legs). For example, consider a letter T which is composed of a horizontal and vertical bar. If the positions of these two bars are specified precisely, then we can specify the position of the letter T more crudely (sufficient for it to be ”bound” to the two bars). This relates to Lee and Mumford’s high- resolution buffer hypothesis [11] and possibly to the experimental finding that neurons higher up thevisualpathwayaretunedtoincreasinglycompleximagefeaturesbutaredecreasinglysensitive to spatial position. The third factor is parallelism which arises because the part dictionaries can beimplementedinparallel,essentiallyhavingasetofreceptivefields,foreachdictionaryelement. Thisenablesextremelyrapidinferenceatthecostofalarger,butparallel,graphicalmodel. The compositional section (2) introduces the key ideas. Section (3) describes the inference algo- rithms for serial and parallel implementations. Section (4) performs a complexity analysis and showspotentialexponentialgainsbyusingcompositionalmodels. 2 TheCompositionalModels Compositionalmodelsarebasedontheideathatobjectsarebuiltbycompositionsofpartswhich, inturn,arecompositionsofmoreelementaryparts. Thesearebuiltbypart-subpartcompositions. 3Thesegraphicalmodelscontainclosedloopsbutwithrestrictedmaximalcliquesize. 2 (a) (b) Figure 1: (a) Compositional part-subpart models for T and L are constructed from the same ele- mentarycomponentsτ ,τ ,horizontalandverticalbarusingdifferentspatialrelationsλ = (µ,σ), 1 2 whichimposelocality. Thestatexoftheparentnodegivesthesummarypositionoftheobject,the executivesummary,whilethepositionsx ,x ofthecomponentsgivedetailsaboutitscomponents. 1 2 (b)Thehierarchicallattices.Thesizeofthelatticesdecreasewithscalebyafactorqwhichhelpsen- forceexecutivesummaryandpreventhavingmultiplehypotheseswhichoverlaptoomuch. q =1/4 inthisfigure. 2.1 CompositionalPart-Subparts Weformulatepart-subpartcompositionsbyprobabilisticgraphicalmodelwhichspecifieshowapart iscomposedofitssubparts.Aparentnodeνofthegraphrepresentsthepartbyitstypeτ andastate ν variablex (e.g.,x couldindicatethepositionofthepart). TherchildnodesCh(ν)=(ν ,...,ν ) ν ν 1 r representthepartsbytheirtypesτ ,...,τ andstatevariables(cid:126)x =(x ,...,x )4. Thetype ν1 νr Ch(ν) ν1 νr oftheparentnodeisspecifiedbyτ =(τ ,...,τ ,λ ).Here(τ ,...,τ )arethetypesofthechild ν ν1 νr ν ν1 νr nodes, andλ specifiesadistributionoverthestatesofthesubparts(e.g.,overtheirrelativespatial ν positions). Hencethetypeτ oftheparentspecifiesthepart-subpartcompositionalmodel. ν Theprobabilitydistributionforthepart-subpartmodelrelatesthestatesofthepartandthesubparts by: P((cid:126)x x ;τ )=δ(x f((cid:126)x ))h((cid:126)x ;λ ). (1) Ch(ν) ν ν ν Ch(ν) Ch(ν) ν | − Here f(.) is a deterministic function, so the state of the parent node is determined uniquely by the state of the child nodes. The function h(.) specifies a distribution on the relative states of the child nodes. The distribution P((cid:126)x x ;τ ) obeys a locality principle, which means that Ch(ν) ν ν | P((cid:126)x x ;τ )=0,unless x x issmallerthanathresholdforalli=1,..,r. Thisrequire- Ch(ν)| ν ν | νi− ν| mentcapturestheintuitionthatsubpartsofapartaretypicallyclosetogether. The state variable x of the parent node provides an executive summary description of the part. ν Hencetheyarerestrictedtotakeasmallersetofvaluesthanthestatevariablesx ofthesubparts. νi Intuitively, the state x of the parent offers summary information (e.g., there is a cow in the right ν side of a field) while the child states (cid:126)x offer more detailed information (e.g., the position of Ch(ν) thepartsofthecow). Ingeneral,informationabouttheobjectisrepresentedinadistributedmanner withcoarseinformationattheupperlevelsofthehierarchyandmorepreciseinformationatlower levels. We give examples of part-subpart compositions in figure (1(a)). The compositions represent the lettersT andL,whicharethetypesoftheparentnodes. Thetypesofthechildnodesarehorizontal andverticalbars,indicatedbyτ =H, τ =V. Thechildstatevariablesx ,x indicatetheimage 1 2 1 2 positionsofthehorizontalandverticalbars.Thestatevariablexoftheparentnodegivesasummary descriptionofthepositionofthelettersT andL.ThecompositionalmodelsforlettersT andLdiffer bytheirλparameterwhichspeciestherelativepositionsofthehorizontalandverticalbars. Inthis example,wechooseh(.;λ)toaGaussiandistribution,soλ = (µ,σ)whereµisthemeanrelative positions between the bars and σ is the covariance. We set f(x ,x ) = (1/2)(x +x ), so the 1 2 1 2 stateoftheparentnodespecifiestheaveragepositionsofthechildnodes(i.e. thepositionsofthe twobars). HencethetwocompositionalmodelsfortheT andLhavetypesτ = (H,V,λ )and T T τ =(H,V,λ ). L L 4Inthispaper,weassumeafixedvaluerforallpart-subpartcompositions. 3 OR Figure2: LeftPanel: Twopart-subpartmodels. CenterPanel: Combiningtwopart-subpartmodels bycompositiontomakeahigherlevelmodel. RightPanel: Someexamplesoftheshapesthatcan begeneratedbydifferentparameterssettingsλofthedistribution. 2.2 ModelsofObjectCategories An object category can be modeled by repeated part-subpart compositions. This is illustrated in figure (2) where we combine T’s and L’s with other parts to form more complex objects. More generally,wecancombinepart-subpartcompositionsintobiggerstructuresbytreatingthepartsas subpartsofhigherorderparts. Moreformally,anobjectcategoryoftypeτ isrepresentedbyaprobabilitydistributiondefinedover H (cid:83) agraphV. Thisgraphhasahierarchicalstructurewithlevelsh∈{0,...,H},whereV = Hh=0Vh. Eachobjecthasasingle,rootnode,atlevel- (i.e. containsasinglenode). Anynodeν h (for h > 0) has r children nodes Ch(ν) inH inVdHexed by (ν ,...,ν ). Hence there are r∈ Vh h 1 1 r H− nodesatlevel-h(i.e. =r h). V − h H− |V | Ateachnodeνthereisastatevariablex whichindicatesspatialpositionandtypeτ . Thetypeτ ν ν oftherootnodeindicatestheobjectcategoryandalsospecifiesthetypesofitsparts. H Thepositionvariablesx takevaluesinasetoflattices :h=0,..., ,sothatalevel-hnode, ν h {D H} ν ,takespositionx . Theleafnodes ofthegraphtakevaluesontheimagelattice . h ν h 0 0 ∈V ∈D V D The lattices are evenly spaced and the number of lattice points decreases by a factor of q < 1 for eachlevel, so = qh , seefigure(1(b)). Thisdecreaseinnumberoflatticepointsimposes h 0 |D | |D | the executive summary principle. The lattice spacing is designed so that parts do not overlap. At higherlevelsofthehierarchythepartscoverlargerregionsoftheimageandsothelatticespacing mustbelarger,andhencethenumberoflatticepointssmaller,topreventoverlapping5. The probability model for an object category of type τ is specified by products of part-subpart relations: H (cid:89) P((cid:126)xτ )= P((cid:126)x )x ;τ )U(x ). (2) Ch(ν ν ν | H | H ν∈V/V0 HereU(x )istheuniformdistribution. H 2.3 MultipleObjectCategories,SharedParts,andHierarchicalDictionaries Now suppose we have a set of object categories τ , each of which can be expressed by an equationsuchasequation(2). WeassumethatthesHeo∈bjeHctsshareparts. Toquantifytheamountof partsharingwedefineahierarchicaldictionary : h = 0,..., ,where isthedictionary h h {M H} M of parts at level h. This gives an exhaustive set of the parts of this set of the objects, at all levels h = 0,..., . The elements of the dictionary are composed from elements of the dictionary h bypHart-subpartcompositions6. M h 1 M − This gives an alternative way to think of object models. The type variable τ of a node at level h ν (i.e. in ) indexes an element of the dictionary . Hence objects can be encoded in terms of h h V M thehierarchicaldictionary. Moreover,wecancreatenewobjectsbymakingnewcompositionsfrom existingelementsofthedictionaries. 5Previouswork[20,19]wasnotformulatedonlatticesandusednon-maximalsuppressiontoachievethe sameeffect. 6Theunsupervisedlearningalgorithmin[19]automaticallygeneratesthishierarchicaldictionary. 4 2.4 TheLikelihoodFunctionandtheGenerativeModel Tospecifyagenerativemodelforeachobjectcategoryweproceedasfollows. Thepriorspecifiesa distributionoverthepositionsandtypesoftheleafnodesoftheobjectmodel. Thenthelikelihood functionisspecifiedintermsofthetypeattheleafnodes(e.g.,iftheleafnodeisaverticalbar,then thereisahighprobabilitythattheimagehasaverticaledgeatthatposition). Moreformally,thepriorP((cid:126)xτ ),seeequation(2),specifiesadistributionoverasetofpoints = x : ν (theleafnode|sHofthegraph)andspecifiestheirtypes τ : ν . ThesepoLints ν 0 ν 0 { ∈ V } { ∈ V } arerequiredtolieontheimagelattice(e.g.,x ). Wedenotethisas (x,τ(x)):x where ν 0 ∈D { ∈L} τ(x) is specified in the natural manner (i.e. if x = x then τ(x) = τ ). We specify distributions ν ν P(I(x)τ(x)) for the probability of the image I(x) at x conditioned on the type of the leaf node. | WespecifyadefaultprobabilityP(I(x)τ )atpositionsxwherethereisnoleafnodeoftheobject. 0 | This gives a likelihood function for the states (cid:126)x = x of the object model in terms of the ν { ∈ V} imageI= I(x):x : 0 { ∈D } (cid:89) (cid:89) P(I(cid:126)x)= P(I(x)τ(x)) P(I(x)τ ). (3) 0 | | × | x∈L x∈D0/L Thelikelihoodandtheprior,equations(3,2),giveagenerativemodelforeachobjectcategory. Wecanextendthisinthenaturalmannertogivegenerativemodelsortwo,ormore,objectsinthe image provided they do not overlap. Intuitively, this involves multiple sampling from the prior to determinethetypesofthelatticepixels,followedbysamplingfromP(I(x)τ)attheleafnodesto | determine the image I. Similarly, we have a default background model for the entire image if no objectispresent: (cid:89) P (I)= P(I(x)τ ). (4) B 0 | x∈D0 3 InferencebyDynamicProgramming The inference task is to determine which objects are present in the image and to specify their po- sitions. Thisinvolvestwosubtasks: (i)stateestimation,todeterminetheoptimalstatesofamodel and hence the position of the objects and its parts, and (ii) model selection, to determine whether objectsarepresentornot. Aswewillshow,bothtaskscanbereducedtocalculatingandcomparing log-likelihoodratioswhichcanbeperformedefficientlyusingdynamicprogrammingmethods. Wewillfirstdescribethesimplestcasewhichconsistsofestimatingthestatevariablesofasingle objectmodelandusingmodelselectiontodeterminewhethertheobjectispresentintheimageand, ifso,howmanytimes.Nextweshowthatwecanperforminferenceandmodelselectionformultiple objectsefficientlybyexploitingpartsharing(usinghierarchicaldictionaries). Finally,weshowhow these inference tasks can be performed even more efficiently using a parallel implementation. We stress that we are performing exact inference and no approximations are made. We are simply exploitingpart-sharingsothatcomputationsrequiredforperforminginferenceforoneobjectcanbe re-usedwhenperforminginferenceforotherobjects. 3.1 InferenceTasks: StateDetectionandModelSelection Wefirstdescribeastandarddynamicprogrammingalgorithmforfindingtheoptimalstateofasingle objectcategorymodel. Thenwedescribehowthesamecomputationscanbeusedtoperformmodel selectionandtothedetectionandstateestimationiftheobjectappearsmultipletimesintheimage (non-overlapping). Consider performing inference for a single object category model defined by equations (2,3). To calculatetheMAPestimateofthestatevariablesrequirescomputing(cid:126)x =argmax logP(I(cid:126)x)+ ∗ (cid:126)x { | logP((cid:126)x;τ ) . By subtracting the constant term logP (I) from the righthand side, we can re- B expressthiHsa}sestimating: (cid:88) P(I(x)τ(x)) (cid:88) (cid:126)x∗ =argm(cid:126)xax{ log P(I(x|)τ0) + logP((cid:126)xCh(ν)|xν;τν)+logU(xH)}. (5) x | ν ∈L 5 Figure 3: Left Panel: The feedforward pass propagates hypotheses up to the highest level where thebeststateisselected. CenterPanel: Feedbackpropagatesinformationfromthetopnodedisam- biguating the middle level nodes. Right Panel: Feedback from the middle level nodes propagates backtotheinputlayertoresolveambiguitiesthere. Thisalgorithmrapidlyestimatesthetop-level executivesummarydescriptioninarapidfeed-forwardpass. Thetop-downpassisrequiredtoallow high-level context to eliminate false hypotheses at the lower levels– ”high-level tells low-level to stopgossiping”. Here denotes the positions of the leaf nodes of the graph, which must be determined during L inference. Weestimate(cid:126)x byperformingdynamicprogramming. Thisinvolvesabottom-uppasswhichrecur- ∗ sively computes quantities φ(xh,τh) = argmax(cid:126)x/xh{logPPB(I(|(cid:126)xI)) +logP((cid:126)x;τh)} by the formula: r (cid:88) φ(x ,τ )= max φ(x ,τ )+logP((cid:126)x x ,τ ) . (6) h h (cid:126)xCh(ν){i=1 νi νi Ch(ν)| ∗ν ν } Werefertoφ(x ,τ )asthelocalevidenceforpartτ withstatex (aftermaximizingoverthestates h h h h ofthelowerpartsofthegraphicalmodel). Thislocalevidenceiscomputedbottom-up. Wecallthis thelocalevidencebecauseitignoresthecontextevidenceforthepartwhichwillbeprovidedduring top-down processing (i.e. that evidence for other parts of the object, in consistent positions, will strengthentheevidenceforthispart). Thebottom-uppassoutputstheglobalevidenceφ(x ,τ )forobjectcategoryτ atpositionx . WecandetectthemostprobablestateoftheobjectbyHcoHmputingx =argmaxHφ(x ,τ ). ThHen ∗ wecanperformthetop-downpassofdynamicprogrammingtoestimHatethemostprobHableHstates(cid:126)x ∗ oftheentiremodelbyrecursivelyperforming: r (cid:88) (cid:126)x =arg max φ(x ,τ )+logP((cid:126)x x ,τ ) . (7) ∗Ch(ν) (cid:126)xCh(ν){i=1 νi νi Ch(ν)| ∗ν ν } Thisoutputsthemostprobablestateoftheobjectintheimage. Notethatthebottom-upprocessfirst estimatestheoptimal”executivesummary”descriptionoftheobject(x )andonlylaterdetermines ∗ theoptimalestimatesofthelower-levelstatesoftheobjectinthetop-dHownpass. Hence,thealgo- rithmisfasteratdetectingthatthereisacowintherightsideofthefield(estimatedinthebottom-up pass) and is slower at determining the position of the feet of the cow (estimated in the top-down pass). Thisisillustratedinfigure(3). Importantly, we only need to perform slight extensions of this algorithm to compute significantly more. First,wecanperformmodelselection–todetermineiftheobjectispresentintheimage–by determiningifφ(x ,τ )>T,whereT isathreshold. Thisisbecause,byequation(5),φ(x ,τ ) ∗ ∗ isthelog-likelihooHdratHiooftheprobabilitythattheobjectispresentatpositionx comparedHtotHhe ∗ probability that the corresponding part of the image is generated by the backgroHund image model P (.). Secondly,wecancomputetheprobabilitythattheobjectoccursseveraltimesintheimage, B by computing the set x : φ(x ,τ ) > T, to compute the ”executive summary” descriptions ∗ for each object (e.g., {theHcoarse HposiHtions of each object). We then perform the top-down pass initialized at each coarse position (i.e. at each point of the set described above) to determine the optimalconfigurationforthestatesoftheobjects. Hence,wecanreusethecomputationsrequired to detect a single object in order to detect multiple instances of the object (provided there are no 6 b Or A B A B a b b c a b c Figure 4: Sharing. Left Panel: Two Level-2 models A and B which share Level-1 model b as a subpart. Center Panel: Level-1 model b. Inference computation only requires us to do inference over model b once, and then it can be used to computer the optimal states for models A and B. RightPanel:NotethatiswecombinemodelsAandBbyarootORnodethenweobtainagraphical model for both objects. This model has a closed loop which would seem to make inference more challenging. Butbyexploitingtheshapepartwecandoinferenceoptimallydespitetheclosedloop. Inferencecanbedoneonthedictionaries,farright. overlaps)7. Thenumberofobjectsintheimageisdeterminedbythelog-likelihoodratiotestwith respecttothebackgroundmodel. 3.2 InferenceonMultipleObjectsbyPartSharingusingtheHierarchicalDictionaries Nowsupposewewanttodetectinstancesofmanyobjectcategoriesτ simultaneously. We canexploitthesharedpartsbyperforminginferenceusingthehierarcHhic∈alMdicHtionaries. Themainideaisthatweneedtocomputetheglobalevidenceφ(x ,τ )forallobjectsτ and at all positions x in the top-level lattice. These quantities cHouldHbe computed sepaHrat∈elyMfoHr eachobjectbyperforHmingthebottom-uppass, specifiedbyequation(6), foreachobject. Butthis iswastefulbecausetheobjectssharepartsandsowewouldbeperformingthesamecomputations multipletimes. Insteadwecanperformallthenecessarycomputationsmoreefficientlybyworking directlywiththehierarchicaldictionaries. Moreformally,computingtheglobalevidenceforallobjectmodelsandatallpositionsisspecified asfollows. Let = x s.t. max φ(x ,τ )>T , ∗ DH { H ∈DH τH∈MH H H H} Forx , letτ (x )=arg max φ(x ,τ ), ∗ ∗ H ∈DH H H τH∈MH H H P(I(cid:126)x) Detect(cid:126)x∗/xH =arg(cid:126)xm/xaHx{log PB(|I) +logP((cid:126)x;τH∗(xH))}forallxH ∈DH∗. (8) Allthesecalculationscanbedoneefficientlyusingthehierarchicaldictionaries(exceptforthemax andargmaxtasksatlevel whichmustbedoneseparately).Recallingthateachdictionaryelement H atlevelhiscomposed, bypart-subpartcomposition, ofdictionaryelementsatlevelh 1. Hence − wecanapplythebottom-upupdateruleinequation(6)directlytothedictionaryelements. Thisis illustratedinfigure(4). Asanalyzedinthenextsection,thiscanyieldmajorgainsincomputational complexity. Once the global evidences for each object model have been computed at each position (in the top lattice) we can perform winner-take-all to estimate the object model which has largest evidence at each position. Then we can apply thresholding to see if it passes the log-likelihood ratio test compared to the background model. If it does pass this log-likelihood test, then we can use the top-downpassofdynamicprogramming,seeequation(7),toestimatethemostprobablestateofall partsoftheobjectmodel. 7Notethatthisisequivalenttoperformingoptimalinferencesimultaneouslyoverasetofdifferentgenerative modelsoftheimage,whereonemodelassumesthatthereisoneinstanceoftheobjectintheimage,another modelsassumestherearetwo,andsoon. 7 Wenotethatweareperformingexactinferenceovermultipleobjectmodelsatthesametime. This is perhaps un-intuitive to some readers because this corresponds to doing exact inference over a probability model which can be expressed as a graph with a large number of closed loops, see figure (4). But the main point is that part-sharing enables us share inference efficiently between manymodels. The only computation which cannot be performed by dynamic programming are the max and argmax tasks at level H, see top line of equation (8). These are simple operations and require orderM calculations. Thiswillusuallybeasmallnumber, comparedtothecomplexity ofotherHcom×p|uDtaHti|ons. Butthiswillbecomeverylargeiftherearealargenumberofobjects,aswe willdiscussinsection(4). 3.3 ParallelFormulationandInferenceAlgorithm Finally,weobservethatallthecomputationsrequiredforperforminginferenceonmultipleobjects canbeparallelized. Thisrequirescomputingthequantitiesinequation(8). Figure 5: Parallel Hierarchical Implementation. Far Left Panel: Four Level-0 models are spaced denselyintheimage(herean8 2grid).LeftPanel:thefourLevel-1modelsaresampledatalower × rate,andeachhave4 1copies. RightPanel: thefourLevel-2modelsaresampledlessfrequently. × Far Right Panel: a bird’s eye view of the parallel hierarchy. The dots represent a ”column” of fourLevel-0models. ThecrossesrepresentcolumnscontainingfourLevel-1models. Thetriangles representacolumnoftheLevel-2models. The parallelization is possible, in the bottom-up pass of dynamic programming, calculations are doneseparatelyforeachpositionx,seeequation(6). Sowecancomputethelocalevidenceforall partsinthehierarchicaldictionaryrecursivelyandinparallelforeachposition,andhencecompute theφ(x ,τ )forallx andτ . Themaxandargmaxoperationsatlevel can alsobeHdoneHinparallelHfo∈reDacHhpositiHon∈xMH . Similarlywecanperformthetop-dowHnpass of dynamic programming, see equation (7)H, i∈n pDaHrallel to compute the best configurations of the detectedobjectsinparallelfordifferentpossiblepositionsoftheobjects(onthetop-levellattice). The parallel formulation can be visualized by making copies of the elements of the hierarchical dictionary elements (the parts), so that a model at level-h has copies, with one copy at each h |D | latticepoint. Henceatlevel-h,wehavem ”receptivefields”ateachlatticepointin witheach h h D one tuned to a different part τ , see figure (5). At level-0, these receptive fields are tuned h h ∈ M to specific image properties (e.g., horizontal or vertical bars). Note that the receptive fields are highlynon-linear(i.e.theydonotobeyanysuperpositionprinciple)8.Moreover,theyareinfluenced bothbybottom-upprocessing(duringthebottom-uppass)andbytop-downprocessing(duringthe top-down pass). The bottom-up processing computes the local evidence while the top-down pass modifiesitbythehigh-levelcontext. Thecomputationsrequiredbythisparallelimplementationareillustratedinfigure(6). Thebottom- up pass is performed by a two-layer network where the first layer performs an AND operation (to compute the local evidence for a specific configuration of the child nodes) and the second layer 8Nevertheless they are broadly speaking, tuned to image stimuli which have the mean shape of the cor- respondingpartτ . Inagreement, withfindingsaboutmammaliancortex, thereceptivefieldsbecomemore h sensitive to image structure (e.g., from bars, to more complex shapes) at increasing levels. Moreover, their sensitivitytospatialpositiondecreasesbecauseathigherlevelsthemodelsonlyencodetheexecutivesummary descriptions,oncoarserlattices,whilethefinerdetailsoftheobjectarerepresentedmorepreciselyatthelower levels. 8 Bottom Up Top Down Figure 6: Parallel implementation of Dynamic Programming. The left part of the figure shows thebottom-uppassofdynamicprogramming. Thelocalevidencefortheparentnodeisobtainedby takingthemaximumofthescoresoftheC possiblestatesofthechildnodes.Thiscanbecomputed r byatwo-layernetworkwherethefirstlevelcomputesthescoresforallC childnodestates,which r can be done in parallel, and the second level compute the maximum score. This is like an AND operationfollowedbyanOR.Thetop-downpassrequirestheparentnodetoselectwhichoftheC r childconfigurationsgavethemaximumscore,andsuppressingtheotherconfigurations. performsanOR,ormaxoperation, todeterminethelocalevidence(bymax-ingoverthepossible childconfigurations)9.Thetop-downpassonlyhastoperformanargmaxcomputationtodetermine whichchildconfigurationgavethebestlocalevidence. 4 ComplexityAnalysis We now analyze the complexity of the inference algorithms for performing the tasks. Firstly, we analyze complexity for a single object (without part-sharing). Secondly, we study the complexity for multiple objects with shared parts. Thirdly, we consider the complexity of the parallel imple- mentation. The complexity is expressed in terms of the following quantities: (I) The size of the image. 0 |D | (II)Thescaledecreasefactorq (enablingexecutivesummary). (III)Thenumber oflevelsofthe H hierarchy. (IV)Thesizes :h=1,..., ofthehierarchicaldictionaries. (V)Thenumberr h {|M | H} ofsubpartsofeachpart. (VI)ThenumberC ofpossiblepart-subpartconfigurations. r 4.1 ComplexityforSingleObjectsandIgnoringPartSharing ThissectionestimatesthecomplexityofinferenceN forasingleobjectandthecomplexityN so mo formultipleobjectswhenpartsharingisnotused. Theseresultsareforcomparisontothecomplex- itiesderivedinthefollowingsectionusingpartsharing. The inference complexity for a single object requires computing: (i) the number N of compu- bu tations required by the bottom-up pass, (ii) the number N of computations required by model ms selectionatthetop-levelofthehierarchy,and(iii)thenumberN ofcomputationsrequiredbythe td top-downpass. ThecomplexityN ofthebottom-uppasscanbecomputedfromequation(6). Thisrequiresatotal bu ofC computationsforeachpositionx foreachlevel-hnode. Therearer hnodesatlevelhand r ν H− eachcantake qh positions. Thisgivesatotalof C qhr h computationsatlevelh. This 0 0 r H− |D | |D | canbesummedoveralllevelstoyield: N =(cid:88)H C r (q/r)h = C r (cid:88)H (q/r)h = C qrH−1 1 (q/r) . (9) bu 0 r H 0 r H 0 r H |D | |D | |D | 1 q/r{ − } h=1 h=1 − ObservethatthemaincontributionstoN comefromthefirstfewlevelsofthehierarchybecause bu thefactors(q/r)hdecreaserapidlywithh. Thiscalculationuses(cid:80)Hh=1xh = x(11−xxH). ForlargeH wecanapproximateN by C qrH−1 (because(q/r) willbesmall). − bu |D0| r1 q/r H − 9Notethatotherhierarchicalmodels,includingbio-inspiredones,usesimilaroperationsbutmotivatedby differentreasons. 9 We calculate N = q for the complexity of model selection (which only requires thresh- ms |H||D0| oldingateverypointonthetop-levellattice). The complexity N of the top-down pass is computed from equation (7). At each level there td are r h nodes and we must compute C computations for each. This yields complexity of |H|− r (cid:80)|hH=|1Crr|H|−h for each possible root node. There are at most q|H||D0| possible root nodes (de- pendingontheresultsofthemodelselectionstage). Thisyieldsanupperbound: r 1 1 N C q |H|− 1 . (10) td ≤|D0| r |H|1 1/r{ − r 1 } − |H− | ClearlythecomplexityisdominatedbythecomplexityN ofthebottom-uppass. Forsimplicity, bu wewillbound/approximatethisby: qr 1 N = C H− (11) so |D0| r1 q/r − Now suppose we perform inference for multiple objects simultaneously without exploiting shared parts. Inthiscasethecomplexitywillscalelinearlywiththenumber ofobjects. Thisgives uscomplexity: |MH| qr 1 N = C H− (12) mo |MH||D0| r1 q/r − 4.2 ComputationwithSharedPartsinSeriesandinParallel Thissectioncomputesthecomplexityusingpartsharing. Firstly,forthestandardserialimplemen- tationofpartsharing. Secondly,fortheparallelimplementation. Now suppose we perform inference on many objects with part sharing using a serial computer. Thisrequiresperformingcomputationsoverthepart-subpartcompositionsbetweenelementsofthe dictionaries. Atlevelhthereare dictionaryelements. Eachcantake = qh possible h h |M | |D | |D| states. The bottom-up pass requires performing C computations for each of them. This gives a r total of (cid:80)Hh=1|Mh|Cr|D0|qh = |D0|Cr(cid:80)Hh=1|Mh|qh computations for the bottom-up process. The complexity of model selection is q ( +1) (this is between all the objects, and the 0 H |D | × H backgroundmodel,atallpointsonthetoplattice). Asintheprevioussection,thecomplexityofthe top-down process is less than the complexity of the bottom-up process. Hence the complexity for multipleobjectsusingpartsharingisgivenby: (cid:88)H N = C qh. (13) ps |D0| r |Mh| h=1 Nextconsidertheparallelimplementation.Inthiscasealmostallofthecomputationsareperformed inparallelandsothecomplexityisnowexpressedintermsofthenumberof”neurons”requiredto encodethedictionaries,seefigure(5). Thisisspecifiedbythetotalnumberofdictionaryelements multipliedbythenumberofspatialcopiesofthem: (cid:88)H N = qh . (14) n h 0 |M | |D | h=1 Thecomputation,boththeforwardandbackwardpassesofdynamicprogramming,arelinearinthe number oflevels. Weonlyneedtoperformthecomputationsillustratedinfigure(6)betweenall H adjacentlevels. Hence the parallel implementation gives speed which is linear in at the cost of a possibly large H numberN of”neurons”andconnectionsbetweenthem. n 4.3 AdvantagesofPartSharinginDifferentRegimes The advantages of part-sharing depend on how the number of parts scales with the level h h |M | of the hierarchy. In this section we consider three different regimes: (I) The exponential growth 10