Emergence of Selective Invariance in Hierarchical Feed Forward Networks DipanK.Pal, VishnuN. Boddeti, MariosSavvides DepartmentofElectricalandComputerEngineering CarnegieMellonUniversity Pittsburgh,PA15213 {dipanp, naresh, marioss}@andrew.cmu.edu 7 1 0 Abstract 2 n Many theories have emerged which investigate how in- a J varianceisgeneratedinhierarchicalnetworksthroughsim- 0 ple schemes such as max and mean pooling. The restric- 3 tiontomax/meanpoolingintheoreticalandempiricalstud- ies has diverted attention away from a more general way ] G of generating invariance to nuisance transformations. In thisexploratorystudy, westudytheconjecturethathierar- L chically building selective invariance is important for pat- . s ternrecognition.Wedefineselectiveinvarianceascarefully c [ choosing the range of the transformation to be invariant Figure1:Afewrepresentativepoolingweightsthatemergewithingpool- to at each layer of a hierarchical network. For the pur- ingelementswhenusingadaptivepoolingontheSVHNdataset, inthe 1 pose of our study, we utilize a novel method called adap- firstlayerofahierarchicalfeedforwardnetwork. Thepoolingelements v withtheseweightsareselectivelyinvarianttothetransformationswithin 7 tivepoolingwherethepoolingweightsarenotconstrained thisregion. Darkerareashaveverylowweightanddonotfeedtheinput 3 and in fact can adapt their pooling regions to the data. inthoseareasforward. The(lighter)regionsalsodefinetherangeofthe 8 Thesenetworkswiththeadaptedpoolingregionsmaintain transformationtobeinvariantto.Interestingly,poolingelementsconverge 8 performancesonobjectcategorizationtaskscomparableto tomeanpoolingdespitebeinginitializedtorandompoolingweights. A 0 fewprefertobeagnostictoalltransformedinputs(secondcolumn). Re- max/mean pooling networks despite being more prone to . dundancyisobserved,withmultiplepoolingelementsbeinginvariantto 1 overfitting. Interestingly,adaptivepoolingregionscancon- similarrangesoftransformations(thirdcolumn).Existenceofgenerallin- 0 verge to mean pooling (even when initialized with random earpoolingschemescallformoregeneraltheoriesthatdonotspecifically 7 poolingregions),findmoregenerallinearpoolingschemes assumemaxpooling. 1 : orevendecidenottopoolatall. Thepoolingregionsthat v emergefromthedataarenotrandombutrathercontiguous, i tal task each of these modules is achieving. In this paper, X illustratinginvariancetocontiguousrangesoftransforma- wefocusonpooling, whichhasreceivedrelativelylessat- r tions. We illustrate the general notion of selective invari- tentioncomparedtofilterweights, overallarchitectureand a ance through object categorization experiments on large- trainingprocedures. scaledatasetssuchasSVHNandILSVRC2012. Traditionally, pooling is conducted over blocks or re- gionsovertheconvolutionmap(aftertheconvolutionstep) using operators such as max or mean. Essentially, pool- 1.Introduction ing is part of the architecture and defines what the “con- Convolutional nets (ConvNets [15]) have gained im- nections” between modules are. Whereas learning the fil- mensepopularityoverthepastdecade.Despitealotofstud- ter/kernelweightsdefinesthe“tuning”propertiesincanoni- ies to improve their generalization abilities, much of their calarchitectures.Somealternativemethodsofpoolinghave fundamentalarchitectureremainsthesame. Thisillustrates recentlystartedbeingexplored. However,thesestudiesap- that the basic hierarchical modules involving convolution, proachpoolingeitherfromanengineeringstandpoint(mak- non-linearity and pooling operations are very effective in ingConvNetsflexibleintermsofinputsize[10]),orfroma variousdomainsandmodalities,includingvision.Nonethe- regularizationperspective[22,9]. Nonetheless,theefficacy less,thereismuchlefttoanswerregardingwhatfundamen- ofsuchadiversesetofapproachessuggestthatpoolingcan 1 beeffectiveevenwhenimplementedinvariousways. Fur- invariant(evenpartiallyi.e.invarianttoasubsetofG)tothe thersuggestingthatperhapsdeeperand morefundamental transformationgroupG allowforthesamplecomplexityto objectivesareatwork. bereduced[1,17]. Previousworksuchas[1,16]showthat Oneofthefundamentalobjectivethatpoolingtriestoop- even though complete invariance is unachievable in prac- timizefor,ishypothesizedinmanyworkstobegenerating tice,(asconsolation)partialinvarianceholds. Inthispaper, invariancetoinputnuisancetransformations[3,19,1]. In- we hypothesize and empirically support the claim that se- deed,generatingusefulrepresentationsthatareinvariantto lectivepartialinvarianceisinfactnecessaryforgoodrecog- such transformations is arguably one of the core problems nitionperformance. inmanyfields,suchasvision. Poolingisthus,usuallyseen Indeed,theneedforinvariancetospecifictransformation asatoolforintroducingsuchinvarianceanditisusedand ranges in the data has been relatively understated. More implemented as such. Many successful hierarchical archi- formally, the specific subset G of the group G to which 0 tecturessuchasConvNetsemploypoolinginadeterminis- the feature is invariant should be carefully chosen. It is tic and organized manner, optimized more for engineering morecommontoinvestigateexplicitcompleteinvarianceto benefits,suchasfastcomputationandeasyimplementation, transformationgroupssuchastherotationgroupand/orthe ratherthanaccuracy. Thisisalsosincethespecificparam- translationgroup[5,7]. Groupinvariantscatteringwasalso eters of the pooling layers (such as pooling field size) are proposed as a theoretical framework for modelling entire usually selected according to heuristics and intuition due translation group invariances in ConvNets as contractions to lack of a deeper understanding of what objective pool- actingontheentirespaceglobally[18]. ingtriestoachieveandneedforspecificpoolingschemes. Selective Invariance: In this paper, we argue that for Poolingistraditionallynotoptimizedforthedataandonly vision tasks (and perhaps in general), selective invariance specifichyperparametersaretunedaccordinglyduringval- (also equivariance) is important. Consider the range of idationtoimprovegeneralizationofthenetwork. a transformation, which refers to the extent by which the transformationisappliedtoasamplepoint. Byselectivein- 2. A conjecture involving Selective Invariance variance, we emphasize that it is necessary to be invariant toInputTransformations tocarefullychosenpartsoftherangeofthetransformation. Additionally, it is also equally important to be equivariant We now review and introduce some concepts useful toadifferentpartoftherangeofthesametransformation. throughouttherestofthepaper. Toillustrate,considerthesub-taskofdistinguishingbe- Group: A group is a mathematical structure encoding tween6and9inanopticalcharacterrecognitiontask.Ifthe symmetrythroughasetofelementsalongwithagroupop- transformationweconsiderisrotation,thena180◦rotation erationwhichactsonanytwoelements.Thestructureneeds turnsthe6intoa9.Hence,theclassifiernotonlyneedstobe to satisfy four axioms namely, closure, associativity, iden- invarianttorotationbetween−90◦ andsay90◦ but,inthis tity and invertibility in order for the structure to be a valid hypotheticalsituation,beequivariant totheinfinitesimally group. A group can have a finite number of elements re- smalltransformationasthedigitrotatesjustbeyondthe90◦ sultinginafinitegroup. Incaseswhereagroupisusedto markintotheotherclass. Aclassifierthatiscompletelyin- modelatransformation,anysubsetofthegroupcanbeused varianttotherotationgroup(i.e.invarianttoallanglesfrom todefineaparticularrangeofthetransformation. 0◦ to 360◦) will fail the task. This simple thought experi- Invariant and Equivariant features: Any function f mentillustratestheneedtobeinvarianttopartsoftherange over x ∈ Rd is an invariant feature w.r.t a group G if ofaparticulartransformationwhilebeingequivarianttothe f(x) = f(gx) ∀g ∈ G1. It is an equivariant or a covari- otherparts. ant feature w.r.t the group if f(x) ∝ h(g)f(gx) ∀g ∈ G A conjecture towards a general theory of pooling: where h is a linear function defined over G (cid:55)→ R. Ideally, One of the main contributions of this paper is to provide it is desirable for a feature to be invariant to transforma- empirical evidence and motivate a theoretical understand- tionsthatdoesnotchangetheclasslabeloftheinput(intra- ing of a more general form of linear pooling. We conjec- classtransformations),butbeequivarianttotransformations turethatmoregeneralformsoflinearpoolingexist. These whichdo(inter-classtransformations). moregenerallinearpoolingschemescanworkcomparably Invariance to transformations: A plethora of litera- well to canonical pooling schemes such as max and mean tureexiststoshowthatoneofthecoreproblemsinpattern pooling, thereby forcing one to not be able to ignore them recognitionistogenerateinvariancetotransformationsgin while developing a theoretical model of pooling. We fur- thedata,leadingtosignificantimprovementsinrecognition therconjectureasaconsequence, thatanadditionalobjec- performance[21,4,1,16,11,17]. Further,featuresthatare tive for pooling operates, which suggests that though net- 1Withaslightabuseofnotation,wedenotebygxtheactionofgroup workmodelsmustbuild(partial)invariancehierarchically, elementg∈Gonx it is sufficient to build it selectively. This relaxes the con- 2 ditions required for pooling in order for architectures to 3. Partially Invariant Featuresthrough Partial perform well, thereby allowing for more general pooling GroupIntegration schemes. Inotherwords,usefulinvariant(andequivariant) representationscanbeobtainedbycarefullychoosingspe- A number of theories of invariance have emerged over cificrangesofthetransformationspresentinthedatatobe theyears. Mostofthemrequiresomeassumptionregarding invariant(andequivariant)towards. Theserangesofinvari- thestructureofthetransformations. Oneofthemostcom- ancecanmanytimesbemuchlargerandmorediversethan mon assumptions is that the transformations form a group whatmean/maxpoolingschemessuggestandcanalsolead [1,14,18]. Thisseemsvalidsincetransformationsinmany toredundancyinpooling,wheremultiplespatiallylocalized fields in which the importance of invariant features seems pooling nodes pool across very similar ranges of transfor- naturalsuchasvision, doindeeddealwithcommontrans- mations(seeSection5). Phenomenonsuchastheseinvoke formationsthatformagroup,further,theyareunitary(e.g. theneedforamoregeneraltheoryofpooling. translation and rotation). We motivate the use of adaptive pooling through such a group invariant framework. How- Many previous studies have examined pooling schemes ever,sinceinpractice,allmembersofthegrouparenotob- empirically and theoretically. Mean/max pooling was ex- served,weutilizetheorieswhichhavebeenshowntowork aminedindetailintermsofdiscriminabilityin[3]. Theef- underpartialobservationofthegroup2. fectof maxpooling onhard-vector quantizedfeatures was Consider a unitary group of transformations G with shown to help performance [2]. A study more aligned to- group elements g with finite cardinality (|G|). Unitary wards highlighting the importance of pooling (even with Group:Aunitarygroupisagroupwithelementssatisfying randomconvolutionfilters)is[12].Alloftheseeffortswere theunitaryproperty,i.e. (cid:104)gx,gy(cid:105)=(cid:104)x,y(cid:105)∀x,y∀g ∈G. however,restrictedtoinvestigatingmaxand/ormeanpool- We have the action of a group element g on a sample ingschemes. point x as gx and following this, an orbit is generated as Motivatingadaptivepooling: Weprovideevidencefor the set {gx | g ∈ G}. This orbit is unique to every point ourconjecturethroughtheuseofadaptivepooling. Tothis sinceitthethesetofallvariationsortransformationsofthe effect, we remove constraints on the canonical method of pointasdefinedbyG. Ameasurewhichintroducesinvari- pooling. We utilize a more general linear pooling model ance and allows us to compare two orbits is the distribu- (compared to mean pooling), and use the data itself to op- tionP inducedbyG onasamplex. Itcanbeshownthat x timizepoolingschemes. Themereexistenceofthesemore x ∼ x(cid:48) ⇔ P = P i.e. iftwoimages(x,x(cid:48))areequiva- x x(cid:48) generallinearpoolingschemesinnetworkswhichperform lentundersomeg,thentheirdistributionsareidentical[1]. comparably to max/mean pooling networks (as we find in This is important since we would like to be invariant to G Section5)showthatamoregeneralandfundamentalobjec- butnonethelesshaveacommondiscriminativesignatureor tiveforgeneratinginvarianceisatplay. Thiscouldsuggest feature for {gx | g ∈ G}. One can form such a discrimi- futuretheoriesandframeworksforConvNets(andperhaps nativefeaturebymeasuringpropertiesofthedistributionor otherdeeplearningalgorithms)toallowforandmodelsuch tryingtocharacterizethedistribution. Inordertodoso,any generalpoolingschemes. Althoughmaxandmeanpooling template or filter can be utilized along with the powerful aresimpletoimplementandworkwellinpractice,restrict- propertyofunitarityofthegroupG. ingourtheoreticalunderstandingtosuchspecializedpool- A single filter provides a 1-D projection of the distri- ingschemesmightredirectattentionawayfrommorepow- bution thereby providing one measurement. We can ob- erfulandgeneraltheoriesforinvarianceandperception. tainmanysuchmeasurementsinordertobemorediscrim- Adaptive Pooling: In order to investigate what kind of inative. Such a collection of many such filters together poolingthedatarequires,weproposeanoveladaptivepool- uniquelycharacterizestheorbit. Moreimportantly, unitar- ing layerthat learns or optimizes poolingaccording to the ityofthegroupallowsthefollowingforafiltert loss function and the data. The adaptive pooling layer is trained using standard back propagation along with some (cid:104)gx,t(cid:105)=(cid:104)x,g−1t(cid:105) (1) regularization. Our use for this layer in this paper is very specific. Theadaptivepoolinglayersimplyservesasaway Hence, the distribution of the set {(cid:104)gx,t(cid:105)},∀g ∈ G is topracticallyprovetheexistenceofasetofnetworkparam- exactly as that of {(cid:104)x,g−1t(cid:105)},∀g ∈ G. Following this, eters (filter/kernel weights and pooling weights) that work in order to characterize the orbit of a novel sample un- wellfortheobjectcategorizationtask. Selectiveinvariance der a group, it is not necessary to explicitly observe all its properties emerge in the pooling weights (see Fig. 2 and transformations under the group. Since the orbit and its Fig.6)aswefindinourexperimentslater. Wealsomodel adaptivepoolinginagroupinvariantframeworkandshow 2Agroupissaidtobefullyobserved,ifduringtraining,samplesare availablethathavebeenacteduponbyallmembersofthegroup. Par- how it invokes selective invariance and equivariance prop- tialobservancereferstosettingwhereonlyasubsetofthosesamplesare erties(seeSection4). availableforuse. 3 n o i g e r t n a i r a v i u q E e c a p s Pooling weight (α) Inva1riant Equivarian2t Inva3riant Equ4ivariant Partial group action in input gg32g1InIvnv1variaarinat nrte grieognionv2v3 Feature space Range of finite partial transformation group (a) (b) Figure2: (a)Illustrativedepictionofpoolingweightsovertherangeofapartialtransformationgroup. Differentsectionsareeitheragnostic,invariant orequivariantdependingontheweightsoverthatparticularrange(subsetofthepartialgroup). (b)Diagramillustratinghowsubsetsofthegroupof transformation(ranges,suchasg1,g2,g3)maptodifferentpartsofthefeaturespacedependingonthepoolingweights.v1,v3depictinvariantregionsfor asinglepoolingweight(correspondingtoasinglepoolingelement)whichareinvariantw.r.ttorangesg1,g3andmapthemtosinglepoints(inred)fora giveninput.v2mapsg2toaline(bold)sinceitisequivarianttog2. corresponding distribution is invariant to G, many possi- v ∈ Rn. Let L be the loss that the network optimizes k ble invariant features or measures can be computed. Two for. TheadaptivepoolingmatrixisdefinedbyA ∈ Rm×n sets of examples are 1) the moments of the distribution withnpoolingelements. and 2) a possibly non-linear function of the inner-product Wethushavev = ATu . Welearnthepoolingmatrix k k (i.e. f(x) = η((cid:104)x,g−1t(cid:105)), where η is a non-linear thresh- usingback-propagation. Thegradientw.r.ttotheithrowof olding function). Further, measures of the distribution of A(i.e. A )thenbecomes. i {(cid:104)x,g−1t(cid:105)},∀g ∈ G whereG ⊆ G arealsoinvariantow- 0 0 ing to partial integrals over partial groups [1]. This sets (cid:18) (cid:19) ∂L ∂v ∂L theframeworkfortheincorporationofselectiveinvariance. = k =Uα (2) ∂A ∂A ∂v i i k CarefullychoosingG , onecancontrolwhichrangeofthe 0 transformationafeatureispartiallyinvarianttowards. Note that U = ∂vk is simply a matrix with the every ∂Ai Apreviousstudythatallowedpartialinvariancealthough column as the input u . Whereas α = ∂L is a vector of inamuchsimplisticnon-hierarchicalsettingis[20]. How- coefficients. Now recakll that the response∂vmkap u (i.e. in- k ever, theincorporationofpartial(non-selective)invariance put to the adaptive pooling layer) in the case of ConvNets was due to relaxation in an optimization framework. It isacollectionoftheinnerproductsoftheinputmapofthe has also been argued that local features should only have previous convolution layer with a convolutional kernel ω enough invariance (as opposed to complete invariance) as passed through anon-linearity η. This can be modelledas requiredbytheapplication[23]. Nonetheless,theobserva- a partial translation group G (composed of m translation T tion was presented as a general recommendation and was operators)thatactsuponaconvolutionalfilterωtoformthe notexplicitlystudied. set{g ω |g ∈ G }. Atranslationgroupisagroupcom- T T T posedoftranslationoperators. Theelementsofu arethen k 4.AdaptivePoolingModuleforLearningGen- intheformoftheset{η((cid:104)g ω,x(cid:105))|g ∈ G }. Here,xis T T T eralizedLinearPooling theinputtothepreviousconvolutionlayer. Othernetwork architectureswhichincorporatemoretransformations(such We now describe the adaptive pooling module which as[8,6])canalsobemodelledinthisframeworkaslongas generalizes mean pooling. In traditional ConvNets, mean thetransformationisunitary. poolingishighlystructured.Itisalinearoperationandthus The response of the adaptive pooling layer, i.e. the ith can be approximated using a matrix A. Each row of A is poolingelementatthekthlayerwithapoolingweightvec- called a pooling node/element with an associated pooling torαicomputes weight,anditperformspoolingonsomeregionofthecon- M M volutionmapthroughtheinnerproductandhenceislinear. (cid:88) (cid:88) v = αiη((cid:104)g ω,x(cid:105))= αiη((cid:104)ω,g−1x(cid:105)) ∀g ∈G Max pooling can be modelled in this framework by opti- ki j Tj j Tj Tj T j=1 j=1 mizingthesupportofthemaxoperationinstead. However, (3) wefocusongeneralizedlinearpoolinginthisstudy. Notation: Weleteveryinputtothekthlayerbedenoted The second equality holds from the unitary property of G by u ∈ Rm and the output of the layer be denoted by and the fact that G is a group (partial groups have corre- k 4 (a) (b) Figure3: AfewrepresentativepoolingweightslearnedusingadaptivepoolingontheSVHNdataset. (a)Poolingweightsfromlayer1. Interestingly, poolingelementsconvergetomeanpoolingdespitebeinginitializedtorandompoolingweights. Afewprefertobeagnostictoalltransformedinputs (secondcolumn).Redundancyisobserved,withmultiplepoolingelementsbeinginvarianttosimilarrangesoftransformations(thirdcolumn).(b)Pooling weightsfromlayer2. Moreinterestingselectiveinvariancetospecifictransformationsemerge. Manypoolingelementsareselectivetolargecontiguous rangesoftransformations(circledinblue)whereasafewelementsprefertobeinvarianttomultiplecontiguousranges(multiplepoolingregions, first column,circledinred). spondinginverseswhichalsoformapartialgroup). Hence, variant wherein the response of the pooling element is thepoolingvectorαeffectivelypoolsoverthetransforma- linearly proportional to the group element i.e. v ∝ ki tions of the input. This computes a measure of the orbit h(g )η((cid:104)g ω,x(cid:105))∀g ∈G−1 ⊆G−1,whereG−1 defines i i i equiT T equiT of the input under the partial group. In the case when α therangeoftransformationsinG−1thatthepoolingweight T is a vector with the same coefficients, this is exactly the v isequivariantto,andhisalinearfunctiondefinedover ki groupintegraloverthefinitepartialtranslationgroupGT−1. G (cid:55)→ R. SuchalinearfunctionhexistssincethegroupGT It has been shown that a non-linear feature of the form as is composed of transformations that are linearly related to Equation3ispartiallyinvariantgivenafinitepartialgroup each other. Theselectivity in each pooling element is also [1]. Inpracticehowever,αalsoincludessignificantlyvary- derived from its equivariant or covariant responses to cer- ing coefficients, thus introducing selective invariance (and tain ranges of the transformation apart from the invariant equivaraince)tocertaintransformationrangesinthepartial responsestootherranges. group. In order to reduce overfitting and improve learn- Potentially redundant pooling elements: Adaptive abilityofthepoolingweights,weconstrainthe(cid:96) normof 1 eachpoolingelementtobe13.Undersuchregularization,α pooling is a generalization of mean pooling. Unlike mean pooling however, every pooled element is not restricted to willcomposeofnon-zeroelementsaswellaselementsthat pooloverapre-definedp×pgrid(usuallywithp = 2,3). are negligibly small. This defines the support of the range Each pooling element can in fact, in theory, pool over the ofthetransformationsthatthepoolingelementisinvariant entireresponsemapofthepreviouslayer. Theoverlapbe- to. Fig.2(a)illustratestherangeoftransformationsthatthe tween different pooling elements is not pre-defined to be pooling weight is invariant and equivariant to for a typical complimentaryandcaninfactbesimilarformultiplepool- poolingresponse. Part1ofthepoolingelementinFig.2(a) ing elements, thereby introducing redundancy. This is a containsnearzeroweights,thusthepoolingelementisag- phenomenonwedoindeedobserveinourexperiments(see nostic to all input in that range of transformation. Part 3 Section 5). This is somewhat counter-intuitive to what we is comprised of a range which has approximately constant mightexpecttoprovideagoodrepresentation,whereineach weight.Thispartwillbeinvarianttothatparticularrangeof elementmightbeexpectedtocaptureadifferentfeatureor the transformation while providing an invariant descriptor aspectofthedata. Nonetheless,inpracticesuchaconfigu- oftheorbitoftheinputunderthepartialgroupcorrespond- rationemergestoperformjustaswell,andhenceitprovides ingtothatrange. moreinsightintopoolingandrepresentation. Part 2 and 4 in Fig. 2(a) however, are parts which are not invariant, but rather approximately equivariant or co- Needforredundancy:Redundancyinpoolingelements couldbehypothesizedtoprovideandretainsignificantac- 3Recallthatthegoalistofindpoolingarchitecturesandparametersthat tivationsatanygivenlayer,wherepoolingreducesthesize workwellinpracticewhichdeviatefromthestandardpoolingschemes. oftheactivationmapupthehierarchy. Hencepoolingele- Theregularizationisthusjustified,sincemerelyprovingtheexistenceof mentsmustcompeteand/orcooperatetoutilizethelimited suchparameterizationsisenoughtoshowcaseamoregeneralformofpool- ingandhinttowardsthemorefundamentalgoalofselectiveinvariance. space available in the activation map for activations more 5 95 %)90 acy (85 MMaaxx ppoooolliinngg ((ttreasitn)) cur80 Mean pooling (train) Ac Mean pooling (test) 75 Adaptive pooling (train) Adaptive pooling (test) 70 0 100 200 300 400 Epoch Figure4:RepresentativesamplesfromtheSVHNdataset. (a) 7 usefulforthetask. Havingredundantpoolingelementsthat Max pooling (train) 6 Max pooling (val) are located in the near vicinity of each other spatially pre- Mean pooling (train) sjeecrvt.ethelocalityandcontiguityoftheactivationoftheob- Average loss345 MAAAAddddeaaaaappppntttt iiiipvvvvoeeeeo pppplioooonoooogllll iiii(nnnnvgggga lffff)rrrroooommmm eeeeppppoooocccchhhh 22665555 ((((tvtvrraaaallii))nn)) 5. Emergence of Selective Invariance and Re- 2 dundantPooling 10 10 20 30 40 50 60 70 80 90 Epoch (b) We use standard ConveNets architectures while replac- ing the mean and max pooling layers with adaptive pool- Figure 5: (a) Train and test accuracies (%) on the SVHN test data for ConvNetarchitecturesutilizingmax,meanandadaptivepooling(initial- ing. We train the networks to minimize the logistic soft izedwithrandompoolingweights)(b)Averagelossontrainandvalida- max loss on large-scale classification benchmarks such as tiondataintheILSVRC2012datasetformax,meanandadaptivepooling. theStreetViewHouseNumbers(SVHN)andtheImageNet Adaptivepoolingwasinitializedwithmeanpoolingweightsandconvo- LargeScaleVisualRecognitionChallenge(ILSVRC)2012 lutionalweightsfrommodelsatthe25thand65thepochfromthemean poolingrun. datasets. Mean and max pooling field sizes were fixed at 2×2forallexperiments. 5.1.StreetViewHouseNumbers(SVHN) icantly deviate from the canonical mean pooling scheme. More interestingly, mean pooling emerges in the layer 1 The SVHN dataset has 10 classes corresponding to 10 poolingweightsdespitetherandompoolingweightsinitial- digits and a training and testing data size of about 73,000 ization. Thepoolingelementsadapttothetransformations and 26,000 samples. For this dataset we use a network presentinthedatasetbybeingselectivelyinvarianttomulti- with two convolution layers (64 filters of 5×5 each ) each plerangesoftransformations(firstcolumnofFig.3(b)cir- followed by an adaptive pooling layer. The last two lay- cled in red) or larger contiguous ranges (circled in blue). erswerefullyconnected(1600and128nodes). Non-linear Alsointerestingly,afewpoolingelementsinlayer1tuneto layerswereusedaftereveryconvolution. Thenetworkpa- being completely agnostic to all inputs (second column in rameters were randomly initialised including the adaptive Fig.3(a)). Thisseemstobeanartifactofthedatasetwhich pooling weights. All networks were trained using dropout is composed of optical characters that usually lie near the for400epochs. center as shown in Fig. 3. The backgrounds around the Results: Fig. 5 shows the progression of train and test digits near the edges are irrelevant to the task and hence accuracies for all three pooling schemes. Although adap- transformationsofthoseareasarenotuseful. Theadaptive tivepoolingsuffersinperformanceinitially,itrecoversover poolingelementslearnstobecompletelyagnostictotrans- epochsachievingcloseto∼91%accuracycomparedtothe formationsofallinputsinthatlocality. mean/max pooling result of ∼ 93%. The adaptive pool- ing network, in this particular experiment, was initialized Comparison of Adaptive Pooling to max/mean pool- using random weights. One might expect the network to ing performance. Adaptive pooling has many more pa- havedifficultylearninggiventhelargenumberofparame- rameters than max/mean pooling making it susceptible to ters, however, given the easier task (compared to an even over-fitting and also having the effects of local minima be larger scale classification task such as ILSVRC 2012), the more pronounced. There exist methods to help adaptive network gradients are informative and the network perfor- pooling achieve better performance through regularization manceimproves. etc. Howeverthisisnotthegoalofthisstudy. Thegoalof Fig.3(a)andFig.3(b)showssomerepresentativepool- thestudyistoletpatternsemergefromthedatawithmin- ing weights from the final model learned using adaptive imal regularization and to use very simple and canonical pooling. We find that at layer 2, pooling elements signif- optimizationtechniquessuchasgradientdescent. 6 6.Discussion EmergenceofSelectiveInvariance: Ourfirstobserva- tion through the SVHN (see Fig. 5(a)) and the ILSVRC 2012 (see Fig. 5(b)) experiments, is that adaptive pooling can perform comparably to max/mean pooling schemes. This validates the efficacy of the generalized linear pool- ing parameterization that adaptive pooling finds. In many cases, the pooling weights were found to deviate signif- icantly from mean pooling schemes. Additionally, a few Figure6: Representativepoolingweightslearntusingadaptivepooling fromlayers1(bottomrow)and2(toprow)oftheadaptivepoolingenabled poolingelementswerefoundtobecomecompletelyagnos- AlexNetonILSVRC2012.Layers1and2preservedmeanpooling. tictoallinputsdespitebeinginitializedtorandompooling weights(see Fig.2). Both theseobservationsillustrate the emergence of selectively invariant pooling elements in the networks. 5.2. ImageNet Large Scale Visual Recognition Challenge(ILSVRC)2012 Poolingelementsatlowerlayerstendtobeselectively invariant to smaller contiguous ranges of transforma- TheILSVRC2012challengehasabout1,000classesand over 1.2 million images for training and about 50,000 im- tions:Itisinterestingtofindthatadaptivepoolingelements agesforvalidation. WeusethestandardAlexNetarchitec- initialized with random pooling weights converge to mean ture[13]forthistask.Webenchmarkagainststandardmean poolingatlayer1forSVHN(seeFig.3(a)). Further, even and max pooling. To incorporate adaptive pooling, we re- thoughadaptivepoolingforAlexNetonILSVRC2012was placeallthreepoolinglayersinAlexNetwithadaptivepool- initializedtomeanpoolingweights,meanpoolingwaspre- ing. servedforlayers1and2(seeFig.5). Thisleadstoanob- servationthatinvarianceshouldbegeneratedlocallyforlo- Initialization: Convolutionfiltersforbaselinenetworks cal features in low level representations. This agrees with withmeanandmaxpoolingarealwaysinitializedrandomly. thehypothesisthatlowlevelobjectpartsandtheirfeatures We initialized the network with adaptive pooling in two have fewer transformations that they can undergo, which ways. First, we tried initializing adaptive pooling param- have a smaller support over the input space and hence are etersalongwithconvolutionparametersrandomly. Thisre- local. Pooling elements invariant to those transformations sulted in extremely slow learning owing to the increased are also localized. Mean pooling therefore seems to be a numberofparameters,noregularizationandaharderclas- goodapproximationforinvariantfeaturesatlowerlevelsin sificationtask. Second,wepre-trainedAlexNetwithmean hierarchicalnetworks. poolingfor25andthen65epochs(withconvolutionallay- ersinitializedrandomly)andthenreplacedthemeanpool- Pooling elements at higher layers tend to be se- inglayerswithadaptivepoolingforthemodelsatepoch25 lectively invariant to larger (possibly non-contiguous) and epoch 65. The adaptive pooling layers were then ini- ranges of transformations: As a general trend we also tializedwithmeanpooledweightsandtrainingcontinued. observe that the pooling elements at higher layers such as Results: Fig. 5(b) shows the average training and val- layer3forAlexNet(seeFig.6)andlayer2fortheSVHN idation loss on ILSVRC 2012 for max, mean and adap- network (see Fig. 3(b)) need specialized invariant features tive pooling. After the learning rate drop during training since the pooling weights deviate significantly from mean (after first 10 epochs), adaptive pooling (epoch 65) almost pooling. This is despite AlexNet pooling layers being ini- matches mean and max pooling despite the large increase tializedtomeanpoolingweights. Thisalsoagreeswiththe in the number of parameters. Initialization of the network hypothesisthathighlevelobjectpartsandcompleteobjects tomeanpoolingweightsandtheconvolutionlayerstopre- canundergoamorecomplexsetoftransformations. Pool- trainedconvolutionalfilters(fromepoch25and65respec- ing elements that are selectively invariant at higher layers tively) help overcome adverse effects that accompany the can sometimes be redundant and be invariant to extremely increase. Fig.2andFigs.7(a), 7(b), 7(c), 7(d)showsome large contiguous ranges or even multiple smaller ranges. representative pooling weights from the adaptive pooling Hence, pooling at higher layers needs more careful han- layers (layers 1, 2 and layer 3 respectively) of the model dling. Perhapsmeanpoolingathigherlayersissub-optimal learned using AlexNet pre-trained up until epoch 65. The andmoreeffectivepoolingstrategiesthatareselectivelyin- adaptive pooling layer was fine tuned for about 50 addi- variant could help improve performance of these networks tionalepochs. ingeneral. 7 (a) (b) (c) (d) Figure7:Representativepoolingweightsfromlayer3oftheadaptivepoolingenabledAlexNetonILSVRC2012.Interestingkindsofselectiveinvariance totransformationsemerge.(a)Selectiveinvariancetomultipledisjointrangesoftransformations(poolingonspatiallydiscontinuousregions)(b)Selective invariancetoasinglelargerange(poolingoveralargespatiallycontiguousregion)withmultipleelementsthatarelocatedcloseby. Theyareinvariantto thesamerange(redundantpoolingoverlargeranges)(c)Multipleelementsthatarehighlyselectivelyinvarianttospecificranges(redundantpoolingover specificranges)(d)Highlyselectivelypoolingelements(essentiallymeanpoolingwaspreserved). References [14] D. Laptev and J. M. Buhmann. Transformation-invariant convolutional jungles. In Proceedings of the IEEE Con- [1] F. Anselmi, L. Rosasco, and T. Poggio. On invariance ferenceonComputerVisionandPatternRecognition,pages and selectivity in representation learning. arXiv preprint 3043–3051,2015. arXiv:1503.05938,2015. [15] Y.LeCun, L.Bottou, Y.Bengio, andP.Haffner. Gradient- [2] Y.-L. Boureau, F. Bach, Y. LeCun, and J. Ponce. Learn- based learning applied to document recognition. Proceed- ingmid-levelfeaturesforrecognition. InComputerVision ingsoftheIEEE,86(11):2278–2324,1998. andPatternRecognition(CVPR),2010IEEEConferenceon, [16] J. Z. Leibo, Q. Liao, and T. Poggio. Subtasks of uncon- pages2559–2566.IEEE,2010. strainedfacerecognition. InInternationalJointConference [3] Y.-L.Boureau,J.Ponce,andY.LeCun. Atheoreticalanal- onComputerVision,ImagingandComputerGraphics,VISI- ysis of feature pooling in visual recognition. In Proceed- GRAPP,2014. ingsofthe27thinternationalconferenceonmachinelearn- [17] Q.Liao,J.Z.Leibo,andT.Poggio. Learninginvariantrep- ing(ICML-10),pages111–118,2010. resentations and applications to face verification. In Ad- [4] J. Bruna, A. Szlam, and Y. LeCun. Learning stable group vances in Neural Information Processing Systems, pages invariantrepresentationswithconvolutionalnetworks.arXiv 3057–3065,2013. preprintarXiv:1301.3537,2013. [18] S.Mallat. Groupinvariantscattering. Communicationson [5] T. S. Cohen and M. Welling. Group equivariant convolu- PureandAppliedMathematics,65(10):1331–1398,2012. tionalnetworks. CoRR,abs/1602.07576,2016. [19] A.Saxe,P.W.Koh,Z.Chen,M.Bhand,B.Suresh,andA.Y. [6] S. Dieleman, J. De Fauw, and K. Kavukcuoglu. Exploit- Ng. Onrandomweightsandunsupervisedfeaturelearning. ingcyclicsymmetryinconvolutionalneuralnetworks.arXiv InProceedingsofthe28thinternationalconferenceonma- preprintarXiv:1602.02660,2016. chinelearning(ICML-11),pages1089–1096,2011. [7] S.Dieleman, J.D.Fauw, andK.Kavukcuoglu. Exploiting [20] C. H. Teo, A. Globerson, S. T. Roweis, and A. J. Smola. cyclic symmetry in convolutional neural networks. CoRR, Convexlearningwithinvariances. InAdvancesinneuralin- abs/1602.02660,2016. formationprocessingsystems,pages1489–1496,2007. [8] R.GensandP.M.Domingos. Deepsymmetrynetworks. In [21] J. Wood and J. Shawe-Taylor. Representation theory and Advances in neural information processing systems, pages invariant neural networks. Discrete applied mathematics, 2537–2545,2014. 69(1):33–60,1996. [9] B.Graham. Fractionalmax-pooling. CoRR,abs/1412.6071, [22] M.D.ZeilerandR.Fergus. Stochasticpoolingforregular- 2014. izationofdeepconvolutionalneuralnetworks.arXivpreprint [10] K.He,X.Zhang,S.Ren,andJ.Sun. Spatialpyramidpool- arXiv:1301.3557,2013. ing in deep convolutional networks for visual recognition. [23] J.Zhang,M.Marszałek,S.Lazebnik,andC.Schmid. Local CoRR,abs/1406.4729,2014. features and kernels for classification of texture and object [11] G.E.Hinton. Learningtranslationinvariantrecognitionin categories:Acomprehensivestudy. Internationaljournalof amassivelyparallelnetworks. InPARLEParallelArchitec- computervision,73(2):213–238,2007. turesandLanguagesEurope,pages1–13.Springer,1987. [12] K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. LeCun. What is the best multi-stage architecture for object recog- nition? InComputerVision,2009IEEE12thInternational Conferenceon,pages2146–2153.IEEE,2009. [13] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105,2012. 8

