Gradient Driven Learning for Pooling in Visual Pipeline Feature Extraction Models DerekRoseandItamarArel DeptartmentofElectricalEngineeringandComputerScience UniversityofTennessee 3 [email protected], [email protected] 1 0 2 n Abstract a J Hyper-parameterselectionremainsadauntingtaskwhenbuildingapatternrecog- 6 nitionarchitecturewhichperformswell,particularlyinrecentlyconstructedvisual 1 pipeline models for feature extraction. We re-formulate pooling in an existing pipeline as a function of adjustable pooling map weight parameters and propose ] V the use of supervised error signals from gradient descent to tune the established C mapswithinthemodel. Thistechniqueallowsustolearnwhatwouldotherwise be a design choice within the model and specialize the maps to aggregate areas . s of invariance for the task presented. Preliminary results show moderate poten- c tial gains in classification accuracy and highlight areas of importance within the [ intermediatefeaturerepresentationspace. 1 v 5 1 Introduction 5 7 3 Multi-stage visual pipelines for learning feature representations of images have recently proven . valuable for classifying small objects from a variety of lighting conditions, scales, and poses. An 1 effective variant of these was explored in experiments by [1] that compared features learned and 0 encoded by stages that originate in encoder-decoder networks, deep learning [2], and bags of fea- 3 1 tures models [3]. This architecture partitions images into patches to perform local learning of an : over-complete codebook and uses this codebook to form global representations of the images for v classification. Intermediate or mid-level representations are of high-dimensionality and retain the i X spatial structure of the image. Pooling is an integral late stage that performs the same role as the sub-samplinglayerofaconvolutionalneuralnetwork(CNN):reductioninthefinalnumberoffea- r a tures(passedtotheclassifierornextlayer)byanaggregationoperationmeanttoimproveinvariance tosmallchanges[4]. Unfortunately,eachadditionalstageofthearchitectureoftenaddshyper-parametersformodelselec- tionthatmustbeexplored. Forthepoolinglayerthenumberofpools,theirstructureorspatiallay- out,weightswithinthisregion,andoperator(oftenmax,average,orp-norm)arehyper-parameters thatarefrequentlychosenbyrulesofthumb. Recently,[5]exploredpoolselectionbyoptimization overafulltrainingsetwithanover-completenumberofpoolsandachievedexcellentimprovements overstandardpyramidmodels,althoughtheirmethodusesafeatureselectionstagewithretraining to tractably create a classifier. In this paper we present a method for learning pooling maps with weightparametersthatmayoptimizeortunethefeaturespacerepresentationforbetterdiscrimina- tionthroughstochasticgradientdescent.Thisconvertstwoofthemodelchoicesabovetoparameters whichmaybelearnedfromalimitedstreamoflabeledtrainingdata. Back-propogation through the architecture in [1] is used to obtain the appropriate weight updates. This technique stems back at least to the inception of convolutional neural networks and graph transformernetworks[4],whereeachmoduleorlayerofthenetworkmaybeutilizedinaforward passoutputcalculationandbackwardpassparameterupdate. 1 2 ArchitectureDescriptionandDesign Themulti-stageimagerecognitionarchitectureconsideredfrom[1]consistsofpatchextraction,nor- malizationandwhitening,codebooklearning,featureencoding,spatialpooling,andclassification. Asguidedby[1]weusehyper-parameterssuchasthenumberofcodewords, k = 400, patchsize w = 6 pixels, stride= 1 for dense extraction, triangle encoding, and use of patch whitening and pre-processingwithafocusontheCIFAR-10dataset. Quadrantbasedaveragepoolingwasusedto sub-sample the intermediate feature space representation down to 4k. This choice stemmed from spatialpyramidsandwasnotaheavyfocusfor[1],althoughmotivatingworkin[5]hasshownthere arebothalargenumberofoptionsforpoolingregionsandmuchperformancetobegained. Althoughbuildingpoolsforinvariancecanbepartiallybasedonintuitionwhenpatchextractionis performedspatiallyrelevanttotheoriginalimage(asasinglelayerinthearchitectureconsidered), poolingregionchoiceforlayersoffeaturesthatbuilduponthosegeneratedinthelowestlayerisnot as straightforward (as noted by [6]). Special care must be taken to reorder features and assemble pools that maintain the structure inherent in the image. It would be beneficial if the region could change relative to problem demands as well as layer context, and we propose the use of pooling weightmapsthatmaybeadjustedaslearningproceedswithgradientdescent. 3 StochasticGradientDescentBasedMethodforWeightedPooling To obtain continuous updates as well as a gradient signal we replace the support vector machine originallyusedforclassificationin[1]withafeed-forwardneuralnetworkwithsinglehiddenlayer andmean-squareerrorcostfunction. Theoutputsareone-hotbinaryvectorsforthet = 10target classes and the inputs are pooled features h. If we have a n×n pixel square image and let P = (n(w−1)+1),withdensepatchextractionweobtainaP ×P ×k mid-levelrepresentation(this maybemodifiedfornon-squareimages). Poolingreducesthisrepresentationtop×k thatcanbe flattenedtoformh. Wethenextendthenetworkbacktoincludeanadditionalinputlayerthatholds ourpweightmapswithsizeP ×P thatwedenoteasWi. Theinputstothisnewlayerareencoded featuresgcomputedfromtheP ×P ×kpatches. This method shares many similarities with the sub-sampling layer in a CNN. In the sub-sampling layer each neuron receives the average of the features from the prior layer in its receptive field or pool(receptivefieldsofunitsdonotoverlap). Eachunithasacoefficientandbiasthataretrained withgradientinformationandfeedintoasigmoidactivationfunctionthatcontrolstheresponseof thesub-samplingunit[4]. Asidefromitsapplicationinanalternatecontext,ourapproachdiffersin thatthefeatureswhichareaveragedarenolongerrestrainedtobeingequallyweightedandmultiple pools may utilize the same encoded patches. Learned weights combine encoded features g from themid-levelrepresentationandpassthesethroughalinearactivationwithunityweighttoaneural networkclassifier,i.e. pooledinputshoverencodedpatchesginpooliarecomputed1by P P (cid:88) (cid:88) h =f (g)= Wi ∗g . (1) i pool m,n m,n m=1n=1 We currently have not explored using non-linear activation units with adjustable biases and coef- ficients and although we have no restriction that pools must be non-overlapping as in CNNs, we are currently examining modification to the loss function to maximize diversity among the pools. The weight sharing scheme employed by [4] for learning feature maps is similarly used here to reduce the parameter search space by sharing weights in codeword dimension k of the mid-level representation. Figure1(a)illustratesthepositioningofthelearnedparameterswithinthearchitec- ture. Toreplicatethequadrantbasedpoolingweemployfourmapsthatconnecttoeverypatchand are initially set to W = 1/(P/2)2 inside the hot zone of the corresponding quadrant map and m,n zeroelsewhere. Afterlearningthecodebookwetraintheoriginalnetworkwith128hiddenneurons and stochastic average mini-batch gradient descent (random batches of 10 images) while the pool weightsarefixed. Networkinputsarenormalizedtoh¯ postpoolingwithameanandvariancethat isnotmodified. Afteralearningperiodwefixtheclassifierweightsandlearnthepoolingweights. Wehavefoundthisalternatetrainingtoperformbetterthansimultaneouslearningwheretheneural networkcancompensateforpoolingweightchanges. 1Codewordindexhasbeenomittedforclarity.handghaveak-sizedimensionwhichWissharedover. 2 k Mid-Level … W1 (shared) Representation * 2 h1 (k-dim) 1 g g … 1 2 11 12 h 2 g 21 … FFNN target … h3 (one hot) (fully connected) g PP h 4 P=(n-w)/stride + 1 3 4 * starting maps (a) (b) Figure1:(a)Poolweightmapconnectionsandfeed-forwardnetextension,(b)Learnedweightmaps 3.1 WeightUpdateRuleandPreliminaryResults To update each pool i’s weights with loss function J note we want ∆Wmi,n = −η∂J/∂Wmi,n = −η∂J ∂h¯i . From the original backpropagation pass we may create a new set of sensitivities ∂h¯i∂Wmi,n δ0 = −∂J/∂h¯ = v1Tδ1 (v representing the original input to hidden weights) for the prepended networklayer. Notingh¯i =(hi−µi)/σiand∂h¯i/∂Wmi,n =gm,n/σiweobtaintheupdaterule g ∆Wi =ηδ0 m,n. (2) m,n i σ i The pool weight gradient updates for the images in each mini-batch are averaged; in each, every codeword dimension k contributesto theupdate asa sum. Inpreliminary testswe freezenetwork learningafterpresenting250kexamplesfrom80%oftheCIFAR-10trainingset. Thelastvalidation accuracy before pool learning on the remaining 20% averaged over five trials was 67.56%. After training with an additional 15k training images and checking the validation set every 500 images, the average best post pool learning accuracy was 68.03% for an improvement of 0.57%. Figure 1(b)containsanexampleoftheweightslearnedthathighlightstheinfinitenumberofpossiblepools gradientdescentsearchesover. This method may also be used to tune the choice of p within the weighted p-norm. Open issues remain,particularlyinthelearningratechoice(forwhichwehavetradedthepoolingstructurefor hyper-parameterη =5e−5)andthenumberofmapsneededtocoverseparateareasofinvariance. It wouldbepreferredtoselectmorethanenoughmapsthannecessaryandlatertrimdownredundant features, although we need to be careful to avoid overfitting the training set here (and in general forthisapproach). Unfortunately,eachadditionalmapaddskinputstotheclassifier. Thisproblem relatescloselytofeaturemapcountorhiddenneuroncounthyper-parametersinCNNs. References [1] A.Coates,H.Lee,andA.Y.Ng,“Ananalysisofsingle-layernetworksinunsupervisedfeaturelearning,” inAISTATS14,2011. [2] Y. Bengio, “Learning deep architectures for AI,” Foundations and Trends in Machine Learning, vol. 2, no.1,pp.1–127,2009. [3] S.Lazebnik,C.Schmid,andJ.Ponce,“Beyondbagsoffeatures: Spatialpyramidmatchingforrecogniz- ingnaturalscenecategories,”inComputerVisionandPatternRecognition,2006IEEEComputerSociety Conferenceon,vol.2,2006,pp.2169–2178. [4] Y.LeCun, L.Bottou, Y.Bengio, andP.Haffner, “Gradient-basedlearningappliedtodocumentrecogni- tion,”ProceedingsoftheIEEE,vol.86,no.11,p.22782324,1998. [5] Y. Jia, H. Chang, and T. Darrel, “Beyond spatial pyramids: Receptive field learning for pooled image features,”inComputerVisionandPatternRecognition(CVPR),2012IEEEConferenceon,2012. [6] A.CoatesandA.Y.Ng,“Learningfeaturerepresentationswithk-means,”inNeuralNetworks: Tricksof theTrade,Reloaded. SpringerLNCS,2012. 3