Table Of ContentComplexity of Representation and Inference in
Compositional Models with Part Sharing
AlanL.Yuille RoozbehMottaghi
Depts. ofStatistics,ComputerScience&Psychology DepartmentofComputerScience
UniversityofCalifornia,LosAngeles UniversityofCalifornia,LosAngeles
3
yuille@stat.ucla.edu roozbehm@cs.ucla.edu
1
0
2
n Abstract
a
J
Thispaperdescribesserialandparallelcompositionalmodelsofmultipleobjects
6
with part sharing. Objects are built by part-subpart compositions and expressed
1
in terms of a hierarchical dictionary of object parts. These parts are represented
onlatticesofdecreasingsizeswhichyieldanexecutivesummarydescription. We
]
V describeinferenceandlearningalgorithmsforthesemodels.Weanalyzethecom-
C plexityofthismodelintermsofcomputationtime(forserialcomputers)andnum-
bersofnodes(e.g.,”neurons”)forparallelcomputers. Inparticular,wecompute
.
s the complexity gains by part sharing and its dependence on how the dictionary
c
scales with the level of the hierarchy. We explore three regimes of scaling be-
[
havior where the dictionary size (i) increases exponentially with the level, (ii) is
1 determined by an unsupervised compositional learning algorithm applied to real
v data, (iii) decreases exponentially with scale. This analysis shows that in some
0 regimes the use of shared parts enables algorithms which can perform inference
6
in time linear in the number of levels for an exponential number of objects. In
5
other regimes part sharing has little advantage for serial computers but can give
3
linearprocessingonparallelcomputers.
.
1
0
3
1 Introduction
1
:
v
Afundamentalproblemofvisionishowtodealwiththeenormouscomplexityofimagesandvisual
i
X scenes1. Thetotalnumberofpossibleimagesisalmostinfinitelylarge[8]. Thenumberofobjects
r isalsohugeandhasbeenestimatedataround30,000[2]. Howcanabiological,orartificial,vision
a system deal with this complexity? For example, considering the enormous input space of images
andoutputspaceofobjects,howcanhumansinterpretimagesinlessthan150msec[16]?
There are three main issues involved. Firstly, how can a visual system be designed so that it can
efficientlyrepresentlargeclassesofobjects,includingtheirpartsandsubparts? Secondly,howcan
thevisualsystembedesignedsothatitcanrapidlyinferwhichobject,orobjects,arepresentinan
inputimageandthepositionsoftheirsubparts? And,thirdly,howcanthisrepresentationbelearnt
inanunsupervised,orweaklysupervisedfashion? Inshort, whatvisualarchitecturesenableusto
addressthesethreeissues?
Manyconsiderationssuggestthatvisualarchitecturesshouldbehierarchical. Thestructureofmam-
malianvisualsystemsishierarchicalwiththelowerlevels(e.g.,inareasV1andV2)tunedtosmall
image features while the higher levels (i.e. in area IT) are tuned to objects 2. Moreover, as ap-
preciatedbypioneerssuchasFukushima[5],hierarchicalarchitectureslendthemselvesnaturallyto
1Similarcomplexityissueswillariseforotherperceptualandcognitivemodalities.
2Butjustbecausemammalianvisualsystemsarehierarchicaldoesnotnecessarilyimplythatthisisthebest
designforcomputervisionsystems.
efficientrepresentationsofobjectsintermsofpartsandsubpartswhichcanbesharedbetweenmany
objects. Hierarchical architectures also lead to efficient learning algorithms as illustrated by deep
belieflearningandothers[7].Therearemanyvarietiesofhierarchicalmodelswhichdifferindetails
oftheirrepresentationsandtheirlearningandinferencealgorithms[14,7,15,1,13,18,10,3,9].
But,tothebestofourknowledge,therehasbeennodetailedstudyoftheircomplexityproperties.
Thispaperprovidesamathematicalanalysisofcompositionalmodels[6],whichareasubclassofthe
hierarchicalmodels. Thekeyideaofcompositionalityistoexplicitlyrepresentobjectsbyrecursive
composition from parts and subparts. This gives rise to natural learning and inference algorithms
whichproceedfromsub-partstopartstoobjects(e.g., inferenceisefficientbecausealegdetector
can be used for detecting the legs of cows, horses, and yaks). The explicitness of the object rep-
resentationshelpsquantifytheefficiencyofpart-sharingandmakemathematicalanalysispossible.
ThecompositionalmodelswestudyarebasedontheworkofL.Zhuandhiscollaborators[19,20]
butwemakeseveraltechnicalmodificationsincludingaparallelre-formulationofthemodels. We
notethatinpreviouspapers[19,20]therepresentationsofthecompositionalmodelswerelearntin
anunsupervisedmanner,whichrelatestothememorizationalgorithmsofValiant[17]. Thispaper
does not address learning but instead explores the consequence of the representations which were
learnt.
Our analysis assumes that objects are represented by hierarchical graphical probability models 3
which are composed from more elementary models by part-subpart compositions. An object – a
graphicalmodelwith levels–isdefinedasacompositionofr partswhicharegraphicalmodels
H
with 1levels. Thesepartsaredefinedrecursivelyintermsofsubpartswhicharerepresentedby
H−
graphicalmodelsofincreasinglylowerlevels.Itisconvenienttospecifythesecompositionalmodels
intermsofasetofdictionaries :h=1,..,, wherethelevel-hpartsindictionary are
h h
{M H} M
composed in terms of level-h 1 parts in dictionary . The highest level dictionaries
h 1
representthesetofallobjects.−ThelowestleveldictionMarie−s representtheelementaryfeatMureHs
1
M
that can be measured from the input image. Part-subpart composition enables us to construct a
verylargenumberofobjectsbydifferentcompositionsofelementsfromthelowest-leveldictionary.
It enables us to perform part-sharing during learning and inference, which can lead to enormous
reductionsincomplexity,asourmathematicalanalysiswillshow.
Therearethreefactorswhichenablecomputationalefficiency.Thefirstispart-sharing,asdescribed
above,whichmeansthatweonlyneedtoperforminferenceonthedictionaryelements. Thesecond
istheexecutive-summaryprinciple. Thisprincipleallowsustorepresentthestateofapartcoarsely
becausewearealsorepresentingthestateofitssubparts(e.g.,anexecutivewillonlywanttoknow
that ”there is a horse in the field” and will not care about the precise positions of its legs). For
example, consider a letter T which is composed of a horizontal and vertical bar. If the positions
of these two bars are specified precisely, then we can specify the position of the letter T more
crudely (sufficient for it to be ”bound” to the two bars). This relates to Lee and Mumford’s high-
resolution buffer hypothesis [11] and possibly to the experimental finding that neurons higher up
thevisualpathwayaretunedtoincreasinglycompleximagefeaturesbutaredecreasinglysensitive
to spatial position. The third factor is parallelism which arises because the part dictionaries can
beimplementedinparallel,essentiallyhavingasetofreceptivefields,foreachdictionaryelement.
Thisenablesextremelyrapidinferenceatthecostofalarger,butparallel,graphicalmodel.
The compositional section (2) introduces the key ideas. Section (3) describes the inference algo-
rithms for serial and parallel implementations. Section (4) performs a complexity analysis and
showspotentialexponentialgainsbyusingcompositionalmodels.
2 TheCompositionalModels
Compositionalmodelsarebasedontheideathatobjectsarebuiltbycompositionsofpartswhich,
inturn,arecompositionsofmoreelementaryparts. Thesearebuiltbypart-subpartcompositions.
3Thesegraphicalmodelscontainclosedloopsbutwithrestrictedmaximalcliquesize.
2
(a) (b)
Figure 1: (a) Compositional part-subpart models for T and L are constructed from the same ele-
mentarycomponentsτ ,τ ,horizontalandverticalbarusingdifferentspatialrelationsλ = (µ,σ),
1 2
whichimposelocality. Thestatexoftheparentnodegivesthesummarypositionoftheobject,the
executivesummary,whilethepositionsx ,x ofthecomponentsgivedetailsaboutitscomponents.
1 2
(b)Thehierarchicallattices.Thesizeofthelatticesdecreasewithscalebyafactorqwhichhelpsen-
forceexecutivesummaryandpreventhavingmultiplehypotheseswhichoverlaptoomuch. q =1/4
inthisfigure.
2.1 CompositionalPart-Subparts
Weformulatepart-subpartcompositionsbyprobabilisticgraphicalmodelwhichspecifieshowapart
iscomposedofitssubparts.Aparentnodeνofthegraphrepresentsthepartbyitstypeτ andastate
ν
variablex (e.g.,x couldindicatethepositionofthepart). TherchildnodesCh(ν)=(ν ,...,ν )
ν ν 1 r
representthepartsbytheirtypesτ ,...,τ andstatevariables(cid:126)x =(x ,...,x )4. Thetype
ν1 νr Ch(ν) ν1 νr
oftheparentnodeisspecifiedbyτ =(τ ,...,τ ,λ ).Here(τ ,...,τ )arethetypesofthechild
ν ν1 νr ν ν1 νr
nodes, andλ specifiesadistributionoverthestatesofthesubparts(e.g.,overtheirrelativespatial
ν
positions). Hencethetypeτ oftheparentspecifiesthepart-subpartcompositionalmodel.
ν
Theprobabilitydistributionforthepart-subpartmodelrelatesthestatesofthepartandthesubparts
by:
P((cid:126)x x ;τ )=δ(x f((cid:126)x ))h((cid:126)x ;λ ). (1)
Ch(ν) ν ν ν Ch(ν) Ch(ν) ν
| −
Here f(.) is a deterministic function, so the state of the parent node is determined uniquely by
the state of the child nodes. The function h(.) specifies a distribution on the relative states of
the child nodes. The distribution P((cid:126)x x ;τ ) obeys a locality principle, which means that
Ch(ν) ν ν
|
P((cid:126)x x ;τ )=0,unless x x issmallerthanathresholdforalli=1,..,r. Thisrequire-
Ch(ν)| ν ν | νi− ν|
mentcapturestheintuitionthatsubpartsofapartaretypicallyclosetogether.
The state variable x of the parent node provides an executive summary description of the part.
ν
Hencetheyarerestrictedtotakeasmallersetofvaluesthanthestatevariablesx ofthesubparts.
νi
Intuitively, the state x of the parent offers summary information (e.g., there is a cow in the right
ν
side of a field) while the child states (cid:126)x offer more detailed information (e.g., the position of
Ch(ν)
thepartsofthecow). Ingeneral,informationabouttheobjectisrepresentedinadistributedmanner
withcoarseinformationattheupperlevelsofthehierarchyandmorepreciseinformationatlower
levels.
We give examples of part-subpart compositions in figure (1(a)). The compositions represent the
lettersT andL,whicharethetypesoftheparentnodes. Thetypesofthechildnodesarehorizontal
andverticalbars,indicatedbyτ =H, τ =V. Thechildstatevariablesx ,x indicatetheimage
1 2 1 2
positionsofthehorizontalandverticalbars.Thestatevariablexoftheparentnodegivesasummary
descriptionofthepositionofthelettersT andL.ThecompositionalmodelsforlettersT andLdiffer
bytheirλparameterwhichspeciestherelativepositionsofthehorizontalandverticalbars. Inthis
example,wechooseh(.;λ)toaGaussiandistribution,soλ = (µ,σ)whereµisthemeanrelative
positions between the bars and σ is the covariance. We set f(x ,x ) = (1/2)(x +x ), so the
1 2 1 2
stateoftheparentnodespecifiestheaveragepositionsofthechildnodes(i.e. thepositionsofthe
twobars). HencethetwocompositionalmodelsfortheT andLhavetypesτ = (H,V,λ )and
T T
τ =(H,V,λ ).
L L
4Inthispaper,weassumeafixedvaluerforallpart-subpartcompositions.
3
OR
Figure2: LeftPanel: Twopart-subpartmodels. CenterPanel: Combiningtwopart-subpartmodels
bycompositiontomakeahigherlevelmodel. RightPanel: Someexamplesoftheshapesthatcan
begeneratedbydifferentparameterssettingsλofthedistribution.
2.2 ModelsofObjectCategories
An object category can be modeled by repeated part-subpart compositions. This is illustrated in
figure (2) where we combine T’s and L’s with other parts to form more complex objects. More
generally,wecancombinepart-subpartcompositionsintobiggerstructuresbytreatingthepartsas
subpartsofhigherorderparts.
Moreformally,anobjectcategoryoftypeτ isrepresentedbyaprobabilitydistributiondefinedover
H (cid:83)
agraphV. Thisgraphhasahierarchicalstructurewithlevelsh∈{0,...,H},whereV = Hh=0Vh.
Eachobjecthasasingle,rootnode,atlevel- (i.e. containsasinglenode). Anynodeν
h
(for h > 0) has r children nodes Ch(ν) inH inVdHexed by (ν ,...,ν ). Hence there are r∈ Vh
h 1 1 r H−
nodesatlevel-h(i.e. =r h). V −
h H−
|V |
Ateachnodeνthereisastatevariablex whichindicatesspatialpositionandtypeτ . Thetypeτ
ν ν
oftherootnodeindicatestheobjectcategoryandalsospecifiesthetypesofitsparts. H
Thepositionvariablesx takevaluesinasetoflattices :h=0,..., ,sothatalevel-hnode,
ν h
{D H}
ν ,takespositionx . Theleafnodes ofthegraphtakevaluesontheimagelattice .
h ν h 0 0
∈V ∈D V D
The lattices are evenly spaced and the number of lattice points decreases by a factor of q < 1 for
eachlevel, so = qh , seefigure(1(b)). Thisdecreaseinnumberoflatticepointsimposes
h 0
|D | |D |
the executive summary principle. The lattice spacing is designed so that parts do not overlap. At
higherlevelsofthehierarchythepartscoverlargerregionsoftheimageandsothelatticespacing
mustbelarger,andhencethenumberoflatticepointssmaller,topreventoverlapping5.
The probability model for an object category of type τ is specified by products of part-subpart
relations: H
(cid:89)
P((cid:126)xτ )= P((cid:126)x )x ;τ )U(x ). (2)
Ch(ν ν ν
| H | H
ν∈V/V0
HereU(x )istheuniformdistribution.
H
2.3 MultipleObjectCategories,SharedParts,andHierarchicalDictionaries
Now suppose we have a set of object categories τ , each of which can be expressed by an
equationsuchasequation(2). WeassumethatthesHeo∈bjeHctsshareparts. Toquantifytheamountof
partsharingwedefineahierarchicaldictionary : h = 0,..., ,where isthedictionary
h h
{M H} M
of parts at level h. This gives an exhaustive set of the parts of this set of the objects, at all levels
h = 0,..., . The elements of the dictionary are composed from elements of the dictionary
h
bypHart-subpartcompositions6. M
h 1
M −
This gives an alternative way to think of object models. The type variable τ of a node at level h
ν
(i.e. in ) indexes an element of the dictionary . Hence objects can be encoded in terms of
h h
V M
thehierarchicaldictionary. Moreover,wecancreatenewobjectsbymakingnewcompositionsfrom
existingelementsofthedictionaries.
5Previouswork[20,19]wasnotformulatedonlatticesandusednon-maximalsuppressiontoachievethe
sameeffect.
6Theunsupervisedlearningalgorithmin[19]automaticallygeneratesthishierarchicaldictionary.
4
2.4 TheLikelihoodFunctionandtheGenerativeModel
Tospecifyagenerativemodelforeachobjectcategoryweproceedasfollows. Thepriorspecifiesa
distributionoverthepositionsandtypesoftheleafnodesoftheobjectmodel. Thenthelikelihood
functionisspecifiedintermsofthetypeattheleafnodes(e.g.,iftheleafnodeisaverticalbar,then
thereisahighprobabilitythattheimagehasaverticaledgeatthatposition).
Moreformally,thepriorP((cid:126)xτ ),seeequation(2),specifiesadistributionoverasetofpoints =
x : ν (theleafnode|sHofthegraph)andspecifiestheirtypes τ : ν . ThesepoLints
ν 0 ν 0
{ ∈ V } { ∈ V }
arerequiredtolieontheimagelattice(e.g.,x ). Wedenotethisas (x,τ(x)):x where
ν 0
∈D { ∈L}
τ(x) is specified in the natural manner (i.e. if x = x then τ(x) = τ ). We specify distributions
ν ν
P(I(x)τ(x)) for the probability of the image I(x) at x conditioned on the type of the leaf node.
|
WespecifyadefaultprobabilityP(I(x)τ )atpositionsxwherethereisnoleafnodeoftheobject.
0
|
This gives a likelihood function for the states (cid:126)x = x of the object model in terms of the
ν
{ ∈ V}
imageI= I(x):x :
0
{ ∈D }
(cid:89) (cid:89)
P(I(cid:126)x)= P(I(x)τ(x)) P(I(x)τ ). (3)
0
| | × |
x∈L x∈D0/L
Thelikelihoodandtheprior,equations(3,2),giveagenerativemodelforeachobjectcategory.
Wecanextendthisinthenaturalmannertogivegenerativemodelsortwo,ormore,objectsinthe
image provided they do not overlap. Intuitively, this involves multiple sampling from the prior to
determinethetypesofthelatticepixels,followedbysamplingfromP(I(x)τ)attheleafnodesto
|
determine the image I. Similarly, we have a default background model for the entire image if no
objectispresent:
(cid:89)
P (I)= P(I(x)τ ). (4)
B 0
|
x∈D0
3 InferencebyDynamicProgramming
The inference task is to determine which objects are present in the image and to specify their po-
sitions. Thisinvolvestwosubtasks: (i)stateestimation,todeterminetheoptimalstatesofamodel
and hence the position of the objects and its parts, and (ii) model selection, to determine whether
objectsarepresentornot. Aswewillshow,bothtaskscanbereducedtocalculatingandcomparing
log-likelihoodratioswhichcanbeperformedefficientlyusingdynamicprogrammingmethods.
Wewillfirstdescribethesimplestcasewhichconsistsofestimatingthestatevariablesofasingle
objectmodelandusingmodelselectiontodeterminewhethertheobjectispresentintheimageand,
ifso,howmanytimes.Nextweshowthatwecanperforminferenceandmodelselectionformultiple
objectsefficientlybyexploitingpartsharing(usinghierarchicaldictionaries). Finally,weshowhow
these inference tasks can be performed even more efficiently using a parallel implementation. We
stress that we are performing exact inference and no approximations are made. We are simply
exploitingpart-sharingsothatcomputationsrequiredforperforminginferenceforoneobjectcanbe
re-usedwhenperforminginferenceforotherobjects.
3.1 InferenceTasks: StateDetectionandModelSelection
Wefirstdescribeastandarddynamicprogrammingalgorithmforfindingtheoptimalstateofasingle
objectcategorymodel. Thenwedescribehowthesamecomputationscanbeusedtoperformmodel
selectionandtothedetectionandstateestimationiftheobjectappearsmultipletimesintheimage
(non-overlapping).
Consider performing inference for a single object category model defined by equations (2,3). To
calculatetheMAPestimateofthestatevariablesrequirescomputing(cid:126)x =argmax logP(I(cid:126)x)+
∗ (cid:126)x
{ |
logP((cid:126)x;τ ) . By subtracting the constant term logP (I) from the righthand side, we can re-
B
expressthiHsa}sestimating:
(cid:88) P(I(x)τ(x)) (cid:88)
(cid:126)x∗ =argm(cid:126)xax{ log P(I(x|)τ0) + logP((cid:126)xCh(ν)|xν;τν)+logU(xH)}. (5)
x | ν
∈L
5
Figure 3: Left Panel: The feedforward pass propagates hypotheses up to the highest level where
thebeststateisselected. CenterPanel: Feedbackpropagatesinformationfromthetopnodedisam-
biguating the middle level nodes. Right Panel: Feedback from the middle level nodes propagates
backtotheinputlayertoresolveambiguitiesthere. Thisalgorithmrapidlyestimatesthetop-level
executivesummarydescriptioninarapidfeed-forwardpass. Thetop-downpassisrequiredtoallow
high-level context to eliminate false hypotheses at the lower levels– ”high-level tells low-level to
stopgossiping”.
Here denotes the positions of the leaf nodes of the graph, which must be determined during
L
inference.
Weestimate(cid:126)x byperformingdynamicprogramming. Thisinvolvesabottom-uppasswhichrecur-
∗
sively computes quantities φ(xh,τh) = argmax(cid:126)x/xh{logPPB(I(|(cid:126)xI)) +logP((cid:126)x;τh)} by the formula:
r
(cid:88)
φ(x ,τ )= max φ(x ,τ )+logP((cid:126)x x ,τ ) . (6)
h h (cid:126)xCh(ν){i=1 νi νi Ch(ν)| ∗ν ν }
Werefertoφ(x ,τ )asthelocalevidenceforpartτ withstatex (aftermaximizingoverthestates
h h h h
ofthelowerpartsofthegraphicalmodel). Thislocalevidenceiscomputedbottom-up. Wecallthis
thelocalevidencebecauseitignoresthecontextevidenceforthepartwhichwillbeprovidedduring
top-down processing (i.e. that evidence for other parts of the object, in consistent positions, will
strengthentheevidenceforthispart).
Thebottom-uppassoutputstheglobalevidenceφ(x ,τ )forobjectcategoryτ atpositionx .
WecandetectthemostprobablestateoftheobjectbyHcoHmputingx =argmaxHφ(x ,τ ). ThHen
∗
wecanperformthetop-downpassofdynamicprogrammingtoestimHatethemostprobHableHstates(cid:126)x
∗
oftheentiremodelbyrecursivelyperforming:
r
(cid:88)
(cid:126)x =arg max φ(x ,τ )+logP((cid:126)x x ,τ ) . (7)
∗Ch(ν) (cid:126)xCh(ν){i=1 νi νi Ch(ν)| ∗ν ν }
Thisoutputsthemostprobablestateoftheobjectintheimage. Notethatthebottom-upprocessfirst
estimatestheoptimal”executivesummary”descriptionoftheobject(x )andonlylaterdetermines
∗
theoptimalestimatesofthelower-levelstatesoftheobjectinthetop-dHownpass. Hence,thealgo-
rithmisfasteratdetectingthatthereisacowintherightsideofthefield(estimatedinthebottom-up
pass) and is slower at determining the position of the feet of the cow (estimated in the top-down
pass). Thisisillustratedinfigure(3).
Importantly, we only need to perform slight extensions of this algorithm to compute significantly
more. First,wecanperformmodelselection–todetermineiftheobjectispresentintheimage–by
determiningifφ(x ,τ )>T,whereT isathreshold. Thisisbecause,byequation(5),φ(x ,τ )
∗ ∗
isthelog-likelihooHdratHiooftheprobabilitythattheobjectispresentatpositionx comparedHtotHhe
∗
probability that the corresponding part of the image is generated by the backgroHund image model
P (.). Secondly,wecancomputetheprobabilitythattheobjectoccursseveraltimesintheimage,
B
by computing the set x : φ(x ,τ ) > T, to compute the ”executive summary” descriptions
∗
for each object (e.g., {theHcoarse HposiHtions of each object). We then perform the top-down pass
initialized at each coarse position (i.e. at each point of the set described above) to determine the
optimalconfigurationforthestatesoftheobjects. Hence,wecanreusethecomputationsrequired
to detect a single object in order to detect multiple instances of the object (provided there are no
6
b
Or
A B A B
a b b c a b c
Figure 4: Sharing. Left Panel: Two Level-2 models A and B which share Level-1 model b as a
subpart. Center Panel: Level-1 model b. Inference computation only requires us to do inference
over model b once, and then it can be used to computer the optimal states for models A and B.
RightPanel:NotethatiswecombinemodelsAandBbyarootORnodethenweobtainagraphical
model for both objects. This model has a closed loop which would seem to make inference more
challenging. Butbyexploitingtheshapepartwecandoinferenceoptimallydespitetheclosedloop.
Inferencecanbedoneonthedictionaries,farright.
overlaps)7. Thenumberofobjectsintheimageisdeterminedbythelog-likelihoodratiotestwith
respecttothebackgroundmodel.
3.2 InferenceonMultipleObjectsbyPartSharingusingtheHierarchicalDictionaries
Nowsupposewewanttodetectinstancesofmanyobjectcategoriesτ simultaneously. We
canexploitthesharedpartsbyperforminginferenceusingthehierarcHhic∈alMdicHtionaries.
Themainideaisthatweneedtocomputetheglobalevidenceφ(x ,τ )forallobjectsτ
and at all positions x in the top-level lattice. These quantities cHouldHbe computed sepaHrat∈elyMfoHr
eachobjectbyperforHmingthebottom-uppass, specifiedbyequation(6), foreachobject. Butthis
iswastefulbecausetheobjectssharepartsandsowewouldbeperformingthesamecomputations
multipletimes. Insteadwecanperformallthenecessarycomputationsmoreefficientlybyworking
directlywiththehierarchicaldictionaries.
Moreformally,computingtheglobalevidenceforallobjectmodelsandatallpositionsisspecified
asfollows.
Let = x s.t. max φ(x ,τ )>T ,
∗
DH { H ∈DH τH∈MH H H H}
Forx , letτ (x )=arg max φ(x ,τ ),
∗ ∗
H ∈DH H H τH∈MH H H
P(I(cid:126)x)
Detect(cid:126)x∗/xH =arg(cid:126)xm/xaHx{log PB(|I) +logP((cid:126)x;τH∗(xH))}forallxH ∈DH∗. (8)
Allthesecalculationscanbedoneefficientlyusingthehierarchicaldictionaries(exceptforthemax
andargmaxtasksatlevel whichmustbedoneseparately).Recallingthateachdictionaryelement
H
atlevelhiscomposed, bypart-subpartcomposition, ofdictionaryelementsatlevelh 1. Hence
−
wecanapplythebottom-upupdateruleinequation(6)directlytothedictionaryelements. Thisis
illustratedinfigure(4). Asanalyzedinthenextsection,thiscanyieldmajorgainsincomputational
complexity.
Once the global evidences for each object model have been computed at each position (in the top
lattice) we can perform winner-take-all to estimate the object model which has largest evidence
at each position. Then we can apply thresholding to see if it passes the log-likelihood ratio test
compared to the background model. If it does pass this log-likelihood test, then we can use the
top-downpassofdynamicprogramming,seeequation(7),toestimatethemostprobablestateofall
partsoftheobjectmodel.
7Notethatthisisequivalenttoperformingoptimalinferencesimultaneouslyoverasetofdifferentgenerative
modelsoftheimage,whereonemodelassumesthatthereisoneinstanceoftheobjectintheimage,another
modelsassumestherearetwo,andsoon.
7
Wenotethatweareperformingexactinferenceovermultipleobjectmodelsatthesametime. This
is perhaps un-intuitive to some readers because this corresponds to doing exact inference over a
probability model which can be expressed as a graph with a large number of closed loops, see
figure (4). But the main point is that part-sharing enables us share inference efficiently between
manymodels.
The only computation which cannot be performed by dynamic programming are the max and
argmax tasks at level H, see top line of equation (8). These are simple operations and require
orderM calculations. Thiswillusuallybeasmallnumber, comparedtothecomplexity
ofotherHcom×p|uDtaHti|ons. Butthiswillbecomeverylargeiftherearealargenumberofobjects,aswe
willdiscussinsection(4).
3.3 ParallelFormulationandInferenceAlgorithm
Finally,weobservethatallthecomputationsrequiredforperforminginferenceonmultipleobjects
canbeparallelized. Thisrequirescomputingthequantitiesinequation(8).
Figure 5: Parallel Hierarchical Implementation. Far Left Panel: Four Level-0 models are spaced
denselyintheimage(herean8 2grid).LeftPanel:thefourLevel-1modelsaresampledatalower
×
rate,andeachhave4 1copies. RightPanel: thefourLevel-2modelsaresampledlessfrequently.
×
Far Right Panel: a bird’s eye view of the parallel hierarchy. The dots represent a ”column” of
fourLevel-0models. ThecrossesrepresentcolumnscontainingfourLevel-1models. Thetriangles
representacolumnoftheLevel-2models.
The parallelization is possible, in the bottom-up pass of dynamic programming, calculations are
doneseparatelyforeachpositionx,seeequation(6). Sowecancomputethelocalevidenceforall
partsinthehierarchicaldictionaryrecursivelyandinparallelforeachposition,andhencecompute
theφ(x ,τ )forallx andτ . Themaxandargmaxoperationsatlevel can
alsobeHdoneHinparallelHfo∈reDacHhpositiHon∈xMH . Similarlywecanperformthetop-dowHnpass
of dynamic programming, see equation (7)H, i∈n pDaHrallel to compute the best configurations of the
detectedobjectsinparallelfordifferentpossiblepositionsoftheobjects(onthetop-levellattice).
The parallel formulation can be visualized by making copies of the elements of the hierarchical
dictionary elements (the parts), so that a model at level-h has copies, with one copy at each
h
|D |
latticepoint. Henceatlevel-h,wehavem ”receptivefields”ateachlatticepointin witheach
h h
D
one tuned to a different part τ , see figure (5). At level-0, these receptive fields are tuned
h h
∈ M
to specific image properties (e.g., horizontal or vertical bars). Note that the receptive fields are
highlynon-linear(i.e.theydonotobeyanysuperpositionprinciple)8.Moreover,theyareinfluenced
bothbybottom-upprocessing(duringthebottom-uppass)andbytop-downprocessing(duringthe
top-down pass). The bottom-up processing computes the local evidence while the top-down pass
modifiesitbythehigh-levelcontext.
Thecomputationsrequiredbythisparallelimplementationareillustratedinfigure(6). Thebottom-
up pass is performed by a two-layer network where the first layer performs an AND operation (to
compute the local evidence for a specific configuration of the child nodes) and the second layer
8Nevertheless they are broadly speaking, tuned to image stimuli which have the mean shape of the cor-
respondingpartτ . Inagreement, withfindingsaboutmammaliancortex, thereceptivefieldsbecomemore
h
sensitive to image structure (e.g., from bars, to more complex shapes) at increasing levels. Moreover, their
sensitivitytospatialpositiondecreasesbecauseathigherlevelsthemodelsonlyencodetheexecutivesummary
descriptions,oncoarserlattices,whilethefinerdetailsoftheobjectarerepresentedmorepreciselyatthelower
levels.
8
Bottom Up Top Down
Figure 6: Parallel implementation of Dynamic Programming. The left part of the figure shows
thebottom-uppassofdynamicprogramming. Thelocalevidencefortheparentnodeisobtainedby
takingthemaximumofthescoresoftheC possiblestatesofthechildnodes.Thiscanbecomputed
r
byatwo-layernetworkwherethefirstlevelcomputesthescoresforallC childnodestates,which
r
can be done in parallel, and the second level compute the maximum score. This is like an AND
operationfollowedbyanOR.Thetop-downpassrequirestheparentnodetoselectwhichoftheC
r
childconfigurationsgavethemaximumscore,andsuppressingtheotherconfigurations.
performsanOR,ormaxoperation, todeterminethelocalevidence(bymax-ingoverthepossible
childconfigurations)9.Thetop-downpassonlyhastoperformanargmaxcomputationtodetermine
whichchildconfigurationgavethebestlocalevidence.
4 ComplexityAnalysis
We now analyze the complexity of the inference algorithms for performing the tasks. Firstly, we
analyze complexity for a single object (without part-sharing). Secondly, we study the complexity
for multiple objects with shared parts. Thirdly, we consider the complexity of the parallel imple-
mentation.
The complexity is expressed in terms of the following quantities: (I) The size of the image.
0
|D |
(II)Thescaledecreasefactorq (enablingexecutivesummary). (III)Thenumber oflevelsofthe
H
hierarchy. (IV)Thesizes :h=1,..., ofthehierarchicaldictionaries. (V)Thenumberr
h
{|M | H}
ofsubpartsofeachpart. (VI)ThenumberC ofpossiblepart-subpartconfigurations.
r
4.1 ComplexityforSingleObjectsandIgnoringPartSharing
ThissectionestimatesthecomplexityofinferenceN forasingleobjectandthecomplexityN
so mo
formultipleobjectswhenpartsharingisnotused. Theseresultsareforcomparisontothecomplex-
itiesderivedinthefollowingsectionusingpartsharing.
The inference complexity for a single object requires computing: (i) the number N of compu-
bu
tations required by the bottom-up pass, (ii) the number N of computations required by model
ms
selectionatthetop-levelofthehierarchy,and(iii)thenumberN ofcomputationsrequiredbythe
td
top-downpass.
ThecomplexityN ofthebottom-uppasscanbecomputedfromequation(6). Thisrequiresatotal
bu
ofC computationsforeachpositionx foreachlevel-hnode. Therearer hnodesatlevelhand
r ν H−
eachcantake qh positions. Thisgivesatotalof C qhr h computationsatlevelh. This
0 0 r H−
|D | |D |
canbesummedoveralllevelstoyield:
N =(cid:88)H C r (q/r)h = C r (cid:88)H (q/r)h = C qrH−1 1 (q/r) . (9)
bu 0 r H 0 r H 0 r H
|D | |D | |D | 1 q/r{ − }
h=1 h=1 −
ObservethatthemaincontributionstoN comefromthefirstfewlevelsofthehierarchybecause
bu
thefactors(q/r)hdecreaserapidlywithh. Thiscalculationuses(cid:80)Hh=1xh = x(11−xxH). ForlargeH
wecanapproximateN by C qrH−1 (because(q/r) willbesmall). −
bu |D0| r1 q/r H
−
9Notethatotherhierarchicalmodels,includingbio-inspiredones,usesimilaroperationsbutmotivatedby
differentreasons.
9
We calculate N = q for the complexity of model selection (which only requires thresh-
ms |H||D0|
oldingateverypointonthetop-levellattice).
The complexity N of the top-down pass is computed from equation (7). At each level there
td
are r h nodes and we must compute C computations for each. This yields complexity of
|H|− r
(cid:80)|hH=|1Crr|H|−h for each possible root node. There are at most q|H||D0| possible root nodes (de-
pendingontheresultsofthemodelselectionstage). Thisyieldsanupperbound:
r 1 1
N C q |H|− 1 . (10)
td ≤|D0| r |H|1 1/r{ − r 1 }
− |H− |
ClearlythecomplexityisdominatedbythecomplexityN ofthebottom-uppass. Forsimplicity,
bu
wewillbound/approximatethisby:
qr 1
N = C H− (11)
so |D0| r1 q/r
−
Now suppose we perform inference for multiple objects simultaneously without exploiting shared
parts. Inthiscasethecomplexitywillscalelinearlywiththenumber ofobjects. Thisgives
uscomplexity: |MH|
qr 1
N = C H− (12)
mo |MH||D0| r1 q/r
−
4.2 ComputationwithSharedPartsinSeriesandinParallel
Thissectioncomputesthecomplexityusingpartsharing. Firstly,forthestandardserialimplemen-
tationofpartsharing. Secondly,fortheparallelimplementation.
Now suppose we perform inference on many objects with part sharing using a serial computer.
Thisrequiresperformingcomputationsoverthepart-subpartcompositionsbetweenelementsofthe
dictionaries. Atlevelhthereare dictionaryelements. Eachcantake = qh possible
h h
|M | |D | |D|
states. The bottom-up pass requires performing C computations for each of them. This gives a
r
total of (cid:80)Hh=1|Mh|Cr|D0|qh = |D0|Cr(cid:80)Hh=1|Mh|qh computations for the bottom-up process.
The complexity of model selection is q ( +1) (this is between all the objects, and the
0 H
|D | × H
backgroundmodel,atallpointsonthetoplattice). Asintheprevioussection,thecomplexityofthe
top-down process is less than the complexity of the bottom-up process. Hence the complexity for
multipleobjectsusingpartsharingisgivenby:
(cid:88)H
N = C qh. (13)
ps |D0| r |Mh|
h=1
Nextconsidertheparallelimplementation.Inthiscasealmostallofthecomputationsareperformed
inparallelandsothecomplexityisnowexpressedintermsofthenumberof”neurons”requiredto
encodethedictionaries,seefigure(5). Thisisspecifiedbythetotalnumberofdictionaryelements
multipliedbythenumberofspatialcopiesofthem:
(cid:88)H
N = qh . (14)
n h 0
|M | |D |
h=1
Thecomputation,boththeforwardandbackwardpassesofdynamicprogramming,arelinearinthe
number oflevels. Weonlyneedtoperformthecomputationsillustratedinfigure(6)betweenall
H
adjacentlevels.
Hence the parallel implementation gives speed which is linear in at the cost of a possibly large
H
numberN of”neurons”andconnectionsbetweenthem.
n
4.3 AdvantagesofPartSharinginDifferentRegimes
The advantages of part-sharing depend on how the number of parts scales with the level h
h
|M |
of the hierarchy. In this section we consider three different regimes: (I) The exponential growth
10