Sparse Deep Stacking Network for Image Classification JunLi,HeyouChang,JianYang SchoolofComputerScienceandTechnology NanjingUniversityofScienceandTechnology,Nanjing,219000,China. [email protected];[email protected];[email protected] 5 Abstract SNNM, the input layer is non-linearly mapped to a hidden 1 layerbyusingaprojectionmatrixWandasigmoidactiva- 0 Sparse coding can learn good robust representation to tion function, and linearly mapped to an output layer by a 2 noise and model more higher-order representation for matrix U. Clearly, W has discriminative ability because it n iims acgoemcpluatsastiifiocnaatliloyn.exHpoewnseivveer,etvheenintfheoruegnhcetahlegosruipthemr- istrainedbyminimizingtheleastsquareserrorbetweenthe a visedsignalsareusedtolearncompactanddiscrimina- output vector and label vector. Moreover, SNNM can fast J tive dictionaries in sparse coding techniques. Luckily, inferthehiddenrepresentationbyonlycalculatingaprojec- 5 asimplifiedneuralnetworkmodule(SNNM)hasbeen tion multiplication and a nonlinear transformation. Follow- proposedtodirectlylearnthediscriminativedictionar- ingastackedscheme(Wolpert1992),manySNNMmodules ] iesforavoidingtheexpensiveinference.ButtheSNNM arefurtherstackedtobuildaDeepStackingNetwork(DSN), V module ignores the sparse representations. Therefore, whichispreviouslynamedtheDeepConvexNetwork(Deng C we propose a sparse SNNM module by adding the and Yu 2011b). Recently, DSN has received increasing at- mixed-norm regularization (l /l norm). The sparse . 1 2 tentions due to its successful application in speech classifi- s SNNM modules are further stacked to build a sparse c deep stacking network (S-DSN). In the experiments, cationandinformationretrieval(Deng,Yu,andPlatt2012; [ Deng, He, and Gao 2013). Additionally, the DSN is attrac- weevaluateS-DSNwithfourdatabases,includingEx- 1 tended YaleB, AR, 15 scene and Caltech101. Experi- tive in that SNNM’s the batch-mode nature offers a poten- v mentalresultsshowthatourmodeloutperformsrelated tial solution to the insurmountable problem of scalability 7 classificationmethodswithonlyalinearclassifier.Itis in dealing with virtually unlimited amount of training data 7 worthnotingthatwereach98.8%recognitionaccuracy availablenowadays(DengandYu2013).Therefore,weex- 7 on15scene. tendDSNforimageclassification. 0 DespiteDSN’ssuccessinspeechclassification,itsframe- 0 Introduction work also has several limitations. First, the conventional 1. DSN only has used the sigmoid activation function for the It is well-known that sparse representations have a number 0 nonlinearhiddenlayer(Deng,Yu,andPlatt2012).Although of theoretical and practical advantages in computer vision 5 sigmoid has been widely used in the literature, it suffers and machine learning (Lee et al. 2007; Gregor and LeCun 1 fromanumberofdrawbacks:forexamplethetrainingcanbe : 2010; Yang et al. 2012). In particular, sparse coding tech- slow,andwithrandominitialization,thesolutioncanstuck v niqueshaveledtopromisingresultsinimageclassification, at a poor local solution that does not have good predictive i X e.g. face recognition and digit classification. Sparse cod- performance(GlorotandBengio2010).Infacttherearean- ing, as a generative model, is a very important way to ex- r other two types of activation functions. The one is hyper- a tractthesparserepresentations.However,sparsecodinghas bolictangent,whichhasbeenappliedtotrainingdeepneural the expensive inference algorithm and does not use the la- networks.Itsuffersfromthesameproblemsasthoseofsig- belofthetrainingdata.Althoughsomeresearchersusethe moidfunctions.Amorerecentproposalistherectifierlinear supervisedsignalstolearncompactanddiscriminativedic- unit(ReLU)(NairandHinton2010).Itisobservedthatthis tionaries (Jiang, Lin, and Davis 2013; Zhuang et al. 2013; methodisveryusefulforobjectrecognitionandoftentrains Huangetal.2013),theexpensiveinferencealgorithmisstill significantlyfaster(Glorot,Bordes,andBengio2011). a problem. Since it is to train the dictionaries by using the Second, sparse representations play a key role in im- labels, do we directly learn the discriminative dictionaries age classification because they have the power to learn foravoidingtheexpensiveinference? good robust features to noise, train gabor-like filters, and Fortunately,asimplifiedneuralnetworkmodule(SNNM) model more higher-order features (Ranzato et al. 2007; (Deng and Yu 2011a) can directly train the discriminative Lee,Ekanadham,andNg2008).Evidently,sparserepresen- dictionaries and fast calculate the representations. In the tationshaveledtopromisingresultsinimageclassification Copyright(cid:13)c 2015,AssociationfortheAdvancementofArtificial (Jiang,Lin,andDavis2013).Furthermore,thereisconsider- Intelligence(www.aaai.org).Allrightsreserved. ableevidencethatinbrainthepercentageofneuronsactive isbetween1and4%(Lennie2003).Itisreasonabletocon- is the sigmoid activation function (Deng and Yu 2011a; sider the sparse representations in SNNM modules. How- DengandYu2011b).TheparametersUandWarelearned ever, the conventional techniques for training SNNM com- tominimizetheleastsquaresobjective: pletely ignores the sparse representations. Generally, they can be achieved by penalizing non-zero activation of hid- minfdsn =(cid:107)UTH−T(cid:107)2F +α(cid:107)U(cid:107)2F (1) U,W den units (Ranzato, Boureau, and LeCun 2008) or a devi- ation of the expected activation of the hidden units (Lee, whereαisaregularizationparameterofupper-layerweight Ekanadham, and Ng 2008) in neural networks. Moreover, matrixU. in neural networks the local dependencies between hidden Clearly,Uhasaclosed-formsolution: units can make hidden units for better modeling observed data (Luo et al. 2011). But SNNM module restricted con- U=(HHT +αI)−1HTT (2) nections within hidden layer can not exhibit these depen- By using a gradient descent (Deng and Yu 2011b) algo- dencies.Fortunately,thehiddenunitswithoutincreasingthe rithmtominimizethetheleastsquaresobjectivein(1)and connectionscanbedividedintonon-overlappinggroupsfor derivingthegradientofWinthebasicmodule,weobtain capturing the local dependencies among hidden units (Luo eutsianlg.2l0/1l1).reTghuelalroiczaaltidoenpuepnodnenthcieeascctaivnabtieonimppolsesmibeilnitteiedsboyf ∂fdsn =2X(cid:2)HT ◦(I−HT)◦(UUTH−UT)T(cid:3) (3) 1 2 ∂W hiddenunitsinSNNMmodule. Inlightoftheaboveargument,thispaperexploitsaSparse where◦denoteselement-wisemultiplicationandIisthema- Deep Stacking Network (S-DSN) for image classification. trixofallones. S-DSNisobtainedbystackingthesparseSNNMmodules, The”convex”solutionaccentuatestheroleofconvexop- which consider the two activation function: ReLU and sig- timizationinlearningtheoutputnetworkweightsUineach moid; and use the group sparse penalties (l /l regulariza- basicmodule(DengandYu2011a).Manybasicmodulesare 1 2 tion)topenalizethehiddenrepresentationsinSNNMmod- oftenstackedupwithoneontopofanothertoformadeep ular. Our S-DSNhas manyexplicit advantages. First, com- model. More specifically, the input units of a higher mod- paredwithsparsecodingtechnique(LC-KSVD(Jiang,Lin, ule can include the output units of the lowest module and andDavis2013)),one-layerS-DSNcanlearntheprojection optionally the raw input feature in the DSN (Deng and Yu dictionaries,whichleadtoafasterinference.Second,com- 2011b). For obtaining the higher-order information in the pared with DSN, S-DSN can extract sparse representations data, DSN has recently been extended to Tensor-DSN (T- for learning good features in image classification. Last, S- DSN) (Hutchinson, Deng, and Yu 2013), which’s the basic DSN can retain the scalable structure of DSN. To conform module is to replace a linear map from hidden representa- theadvantagesoftheS-DSNforimageclassification,exten- tiontooutputwithabilinearmapping.Itretainsthescalable siveexperimentshavebeenperformedonthefourdatabases, structure of DSN and provides the higher-order feature in- including Extended YaleB, AR, 15 scene and Caltech101. teractionsmissinginDSN. Compared with multiple related methods, the experiments show that our model gets better classification results than SparseDeepStackingNetwork other benchmark methods. In particular, we reach 98.8% TheS-DSNisasparsecaseoftheDSN.Thestackingopera- recognitionaccuracyon15scene. tionoftheS-DSNisthesameasthatfortheDSNdescribed in(DengandYu2011b).Thegeneralparadigmistousethe DeepStackingNetwork outputvectoroflowermoduleandtheoriginalinputvector to form the expanded ”input” vector for the higher module The DSN architecture is originally presented in the litera- oftheDSN.ThemodulararchitectureofS-DSNisdifferent ture(DengandYu2011b).DengandYuexploreanoriginal fromthatofDSN.Weconsiderthesigmoidfunctionandthe strategyforbuildingdeepnetworks,basedonstackinglay- ReLU function; and the sparse penalties are added into the ers of the basic SNNM modules, which take the simplified hiddenunitsofmodulararchitecture. formofmultilayerperceptron.Wemathematicallydescribe asfollow: SparseModule Let the target vectors t = [t ,··· ,t ,··· ,t ]T i 1i ji Ci be arranged to form the columns of T = Theoutputofupper-layerisY=UTH(cid:98) andthehiddenlayer [t1,··· ,ti,··· ,tN] ∈ RC×N. Let the input data vec- outputisasfollow: torsx = [x ,··· ,x ,··· ,x ]T bearrangedtoformthe i 1i ji Di columnsofX = [x ,··· ,x ,··· ,x ] ∈ RD×N.Formally, H(cid:98) =φ(WTX)∈RL×N (4) 1 i N inthebasicmodule,thelower-layerweightmatrix,whichis where φ(a) is the sigmoid activation function σ(a) or the denotedbyW∈RD×L,connectsthelinearinputlayerand ReLUactivationfunctionmax(0,a).Forsimplicity,letH= the nonlinear hidden layer. The upper-layer weight matrix, {1,2,··· ,L} denote the set of all hidden units. H is are which is denoted by U ∈ RL×C, connects the nonlinear divided into G groups, where G is the number of groups. hidden layer with the linear output layer. The outputs of upper-layer is Y = UTH, where H = σ(WTX) ∈ RL×N The gth group is denoted by Gg, where H = (cid:83)Gg=1Gg and is the hidden layer outputs and σ(a) = 1/(1 + e−a) (cid:84)Gg=1Gg =∅.So,H(cid:98) =[H(cid:98)G1,:;··· ;H(cid:98)Gg,:;··· ;H(cid:98)GG,:]. TheparametersUandWarelearnedtominimizetheleast Algorithm1TrainingAlgorithmofSparseModular squaresobjective: 1: Input: Data matrix X, label matrix T, parameters θ = {(cid:15),α,β,G}andtrainingepochsE. minfsdsn =(cid:107)UTH(cid:98) −T(cid:107)2F +α(cid:107)U(cid:107)2F +βΨ(H(cid:98)) (5) 2: Initialize: Projection Matrix W are initialized with U,W smallrandomvalues. whereαisaregularizationparameterofupper-layerweight 3: GivenW,calculateH(cid:98) byEq.(4). matrixU,β isaregularizationconstantoftheactivationof 4: UpdateWbyEq.(16). the hidden units and Ψ(H(cid:98)) represents the imposed penalty 5: Repeat3-4E epochs(oruntilconvergence). oversparserepresentationsH(cid:98).Typically,thel1normiscon- 6: OutputweightmatrixW. ducted as a penalty to explicitly enforce sparsity on each sparserepresentation.Itisdescribedas: andderivingthegradient,weobtain N Ψ(H(cid:98))=(cid:88)(cid:107)h(cid:98)i(cid:107)1 (6) ∂fs1dsn =2X(cid:104)dφ(H(cid:98)T)◦(UUTH(cid:98) −UT)T(cid:105) ∂W i=1 (cid:104) T T T(cid:105) +2βX dφ(H(cid:98) )◦H(cid:98) ◦/H(cid:101) (12) where h(cid:98)i is the representation of ith example (i = 1,··· ,N). where ◦ denotes element-wise multiplication, ◦/ denotes In neural networks, sparse representations are advanta- ggeroouups sfpoarrscelarsespifirecsaetniotanti(oRnasn(zBaetongeiot eatl.a2l.020070).9)Mcaonreloevaerrn, e(cid:113)le(cid:80)ment-w(cid:98)his2e,ddiφv(isH(cid:98)ioTn),deH(cid:101)nottehsateleitm’senetl-ewmiseengtraisdie(cid:101)hnjt,icom=- j∈Gg j,i the statistical dependencies between hidden units and lead putation, dφ(a) is the gradient of the activation function. to better performance (Luo et al. 2011). To implement the When φ(a) is the sigmoid activation function, dφ(a) = dependencies, we averagely divide hidden units into non- σ(a)×(1−σ(a)) and when φ(a) is the ReLU activation overlappinggroupstorestrainthedependencieswithinthese function,dφ(a)isdescribedas: groups and force hidden units in a group to compete with (cid:26) eachother(Luoetal.2011).Luckily,amixed-normregular- 1, a>0; dφ(a)= (13) ization(l /l -norm)canbeconductedinthemodulararchi- 0, a≤0. 1 2 tecture to achieve group sparse representations. Following ToReLUactivationfunction,wefollowthehypothesis(Glo- (Luo et al. 2011), we consider the mixed-norm regulariza- rot, Bordes, and Bengio 2011) that the hard non-linearities tion,whichisasfollows: do not hurt the optimization so long as the gradient can be propagatedtomanyhiddenunits. G (cid:88) Ψ(H(cid:98))= (cid:107)H(cid:98)Gg,:(cid:107)1,2 (7) theSeocpotinmda,lfoprofiansttse,rtmheovdientgerWmitnoiwstiacrdnsoandliinreeacrtiorenlathtiaotnfisnhdips g=1 betweenUandWisusedtocomputethegradient.Byplug- whereH(cid:98)Gg,: istherepresentationmatrixassociatedtothose ging (10) into criterion (5), the least squares objective is intra-modalitydatabelongingtothegthgroupandthel /l - rewrittenas: 1 2 normisdefinedas minf2 =(cid:107)[(H(cid:98)H(cid:98)T +αI)−1H(cid:98)TT]TH(cid:98) −T(cid:107)2+ sdsn F N (cid:115) W (cid:107)H(cid:98)Gg,:(cid:107)1,2 =(cid:88) (cid:88) (cid:98)h2j,i (8) α(cid:107)(H(cid:98)H(cid:98)T +αI)−1H(cid:98)TT(cid:107)2F +βΨ(H(cid:98)) (14) i=1 j∈Gg However, when regularization is used in the objective function(5)(i.e. α > 0),thegradientoff2 canbevery LearningWeights-Algorithm sdsn complicated.Tosimplifythegradientweassumeα = 0in Oncethelower-layerweightmatrixWarefixed,H(cid:98) arealso f2 .So,secondtermoff2 isequivalenttozero.Similar sdsn sdsn determined uniquely. Then solving the upper-layer weight to(DengandYu2011b),thenwederivethegradient ∂fs2dsn matrixUcanbeformulatedasaconvexoptimizationprob- ∂W andobtain lem: minfsudsn =(cid:107)UTH(cid:98) −T(cid:107)2F +α(cid:107)U(cid:107)2F (9) ∂fs2dsn =2X(cid:104)dφ(H(cid:98)T)◦[H(cid:98)†(H(cid:98)TT)(TH(cid:98)†)−TT(TH(cid:98)†)](cid:105) U ∂W whichhasaclosed-formsolution: (cid:104) T T T(cid:105) +2βX dφ(H(cid:98) )◦H(cid:98) ◦/H(cid:101) (15) U=(H(cid:98)H(cid:98)T +αI)−1H(cid:98)TT (10) † T T where H(cid:98) = H(cid:98) (H(cid:98)H(cid:98) )−1 and dφ(·) and H(cid:101) are defined in There are two algorithms for learning the lower-layer (12). weight matrix W. First, given fixed current U, W can be ThealgorithmthenupdatesWusingthegradientdefined optimizedusingagradientdescentalgorithm(DengandYu in(12)and(15)as 2011a)tominimizethesquarederrorobjective: ∂f1 ∂f2 minfs1dsn =(cid:107)UTH(cid:98) −T(cid:107)2F +βΨ(H(cid:98)) (11) W=W−(cid:15) ∂sWdsn or W=W−(cid:15) ∂sWdsn (16) W Algorithm2TrainingAlgorithmofS-DSN evaluationprocedure,weuseasubsetofthedatabasecon- 1: Input: Data X, label T, parameters θ = {(cid:15),α,β,G}, sistingof2,600imagesfrom50malesubjectsand50fe- trainingepochsE andthenumberoflayersK. malesubjects.Eachfaceimagewasalsocroppedandnor- 2: Initialize:X1 =Xandk =1. malizedto165×120pixels. 3: whilek ≤K 4: GivenXk,T,θandE,optimizeWk byAlgorithm1. 5: GivenXk andWk,calculateH(cid:98)k byEq.(4),Uk byEq. • Caltech-101:thisdatabase[10]contains9144imagesbe- longing to 101 classes, with about 40 to 800 images per (10)andYk =(cid:0)Uk(cid:1)T H(cid:98)k. class.MostimagesofCaltech-101arewithmediumreso- 6: Xk+1 =[X;Yk]. lution,i.e.,about300×300. 7: endwhile 8: OutputweightmatrixWk(k =1,··· ,K). • 15-Scene: this data set, compiled by several researchers [11,20,24],containsatotalof4485imagesfallinginto15 where (cid:15) is a learning rate. The weight matrix learning pro- categories,withthenumberofimagespercategoryrang- cessisoutlinedinAlgorithm1. ingfrom200to400.Thecategoriesincludelivingroom, bedroom,kitchen,highway,mountainandetal. TheS-DSNArchitecture ThespareSNNMmoduledescribedintheabovesubsection isusedtoconstructtheK-layersS-DSNarchitecture,where According to (Jiang, Lin, and Davis 2013), the four K is the number of layers. In kth spare module we denote databases are preprocessed 1: in the Extended YaleB the input by Xk, hidden representations by H(cid:98)k, output by doanttaobaasne-adnimdeAnRsiofancaelfdeaattaubraesev,eecatocrhwfaictheaimraangdeoimsplyrogjeecnteerd- Yk, label matrix by T and weight matrix by Wk and Uk. ated matrix from a zero-mean normal distribution. The di- GiveninputdataXandlabelT,whenk =1,X1 =X.Then mensionofarandom-facefeatureinExtendedYaleBis504 thegeneralparadigmofS-DSNcanbedecomposedinthree while the dimension in AR face is 540. In face databases phases: then-dimensionalfeaturesofeachimagearenormalizedto • Step1:Trainthekthsparsemoduletominimizetheleast [−1,1].FortheCaltech101database,wefirstextractsiftde- squareserrorbetweenYk andT. scriptorsfrom16×16patches,whicharedenselysampled fromeachimageonadensegridwith6-pixelsstepsize;then • Step 2: Generate the input Xk+1 of the k + 1th sparse weextractthespatialpyramidfeaturebasedontheextracted modulebyaddingtheoutputYk ofkthsparsemodule. siftfeatureswiththreegridsofsize1×1,2×2and4×4.To • Step 3: Iterate as in Step 1 and Step 2 to construct the trainthecodebookforspatialpyramid,weusethestandard S-DSNarchitecture. k-means clustering with k = 1024. For the 15 scene cate- gorydatabase,wecomputethespatialpyramidfeatureusing WesummarizetheoptimizationofS-DSNinAlgorithm afour-levelspatialpyramidandaSIFT-descriptorcodebook 2.Forcapturingthesparerepresentationfromrawdata,this with a sizeof 200. Finally, the spatial pyramid features are paper proposes the S-DSN, which is implemented by pe- reducedto3000dimensionsbyPCA. nalizingthehiddenunitactivationsandrectifyingthenega- tiveofoutputsofhiddenunitsactivations.Duetothesimple The matrix parameters are initialized with small ran- structureofeachmodule,theS-DSNstillretainsthecompu- dom values sampled from a normal distribution with tationaladvantageoftheDSNinparallelismandscalability zero mean and standard deviation of 0.01. For sim- duringlearningallparameters. plicity, we use the constant learning rate (cid:15) cho- sen from {20,15,5,2,1,0.2,0.1,0.05,0.01,0.001}, the Experiments regularization parameter α chosen from {1,0.5,0.1}, We present experimental results on four databases: the Ex- the sparse regularization parameter β chosen from tended YaleB database, the AR face database, Caltech101 {0.1,0.05,0.01,0.001,0.0001} and the group number G and15scenecategories. chosen from {2,4,5,10,20}. In all experiments, we only train 5 epochs, the number of hidden units is 500 and the • Extended YaleB database: this database contains 2,414 number of layers is 2. For each data set, each experiment frontalfaceimagesof38people.Thereareabout64im- isrepeated10timeswithrandomselectedtrainingandtest- ages for each person. The original images were cropped ingimages,andtheaverageprecisionisreported.Intherest andnormalizedto192×168pixels. ofthispaper,wedenotethatS-DSN(sigm)indicatesS-DSN • AR database: this database consists of over 4,000 color with sigmoid function; S-DSN(relu) indicates S-DSN with images of 126 people. Each person has 26 face images ReLU function; DSN-1, S-DSN(sigm)-1 and S-DSN(relu)- taken during two sessions. These images include more 1 respectively indicate one-layer DSN, S-DSN(sigm) and facial variations, including different illumination con- S-DSN(relu); DSN-2, S-DSN(sigm)-2 and S-DSN(relu)-2 ditions, different expressions, and different facial ”dis- respectively indicate two-layer DSN, S-DSN(sigm) and S- guises”(sunglassesandscarves).Followingthestandard DSN(relu). Table 1: Hoyer’s sparseness measures (HSM) on Extended YaleB and AR databases. We train on 15 (10) samples per category for Extended YaleB (AR) and the rest for testing. For two databases, the number of hidden units is 500, the groupsizesforS-DSN(sigm)andS-DSN(relu)are4andthe numberoflayersis2.InExtendedYaleB,(cid:15) = 0.2andα = 0.5 are used for DSN; (cid:15) = 0.2, α = 0.5 and β = 0.001 areusedforS-DSN(sigm).InExtendedYaleB(cid:15)=0.05and α = 1areusedforDSN;(cid:15) = 0.05,α = 1andβ = 0.0001 areusedforS-DSN(relu). Figure 1: Activation probabilities of first hidden layer are computedunderDSNandS-DSN(relu)ontheARdatabase. S-DSN(sigm) DSN Activationprobabilitiesbenormalizedbydividingthemax- layers HSM Acc.(%) HSM Acc.(%) imumofactivationprobabilities. Extended 1 0.096 91.4 0.010 88.9 YaleB 2 0.111 92.0 0.012 89.4 Table3:RecognitionResultsUsingRandomFaceFeatures S-DSN(relu) DSN layers HSM Acc.(%) HSM Acc.(%) ontheARFaceDatabase AR 1 0.286 93.2 0.003 80.2 Methods Acc.(%) Methods Acc.(%) 2 0.306 93.5 0.003 81.2 SRC 97.5 LC-KSVD 97.8 DSN-1 97.6 DSN-2 97.8 Table2:RecognitionResultsUsingRandom-FaceFeatures S-DSN(sigm)-1 97.9 S-DSN(sigm)-2 98.1 ontheExtendedYaleBDatabase S-DSN(relu)-1 97.6 S-DSN(relu)-2 97.8 Methods Acc.(%) Methods Acc.(%) (cid:15) = 0.1 and α = 0.5; in S-DSN(sigm) (cid:15) = 0.1, α = 0.5, SRC 97.2 LC-KSVD 96.7 G = 2, and β = 0.01; in S-DSN(relu) (cid:15) = 0.01, α = 2, DSN-1 96.6 DSN-2 96.9 S-DSN(sigm)-1 96.9 S-DSN(sigm)-2 97.4 G = 5,andβ = 0.001.AR:Foreachperson,werandomly S-DSN(relu)-1 96.1 S-DSN(relu)-2 96.7 select 20 images for training and the other 6 for testing. In our experiments, (cid:15) = 0.1 and α = 0.5 are used in DSN; SparsenessComparisons (cid:15) = 0.1, α = 0.5, G = 4, and β = 0.001 are used in S- DSN(sigm); (cid:15) = 0.01, α = 1, G = 4, and β = 0.001 are Before presenting classification results, we first show the usedinS-DSN(relu). sparseness of S-DSN(sigm) and S-DSN(relu) compared to WecompareS-DSNwithDSN(DengandYu2011b),and DSN. We use Hoyer’s sparseness measure (HSM) (Hoyer LC-KSVD(Jiang,Lin,andDavis2013)andSRC(Wrightet 2004) to figure out how sparse representations learned by al.2009)algorithms,whichreportedstate-of-the-artresults theS-DSN(sigm),S-DSN(relu)andDSN.Thismeasurehas on those two databases. The experimental results are sum- goodproperties,whichisintheinterval[0,1]andonanor- marizedinTable2andTable3,respectively.S-DSN(sigm) malized scale. Its value more close to 1 means that there achievesbetterresultsthanDSN,LC-KSVDandSRC.From are more zero components in the vector. We perform com- Table 2 S-DSN(sigm)-1 is better than LC-KSVD and has parisonsonExtendedYaleBandARdatabases,andresults about 0.2% improvement in Extended YaleB. From Table are reported in Table 1. The sparseness results show that 3, S-DSN(sigm)-1 and S-DSN(sigm)-2 are also better than S-DSN(sigm) and S-DSN(relu) have higher sparseness and LC-KSVDandhaveabout0.1%and0.3%improvementin higherrecognitionaccuracy.Table1comparesthenetwork AR, respectively. In addition, we compare with LC-KSVD HSM of the S-DSN(sigm) and the S-DSN(relu) to that of intermsofthecomputationtimeforclassifyingonetestim- DSN.Weobservethattheaveragesparsenessoftwolayers age. S-DSN has a faster inference because it can directly S-DSN(sigm) is about 0.105 (Extended YaleB) and the av- learnprojectiondictionaries.AsshowninTable4,S-DSNis erage sparseness of two layers S-DSN(relu) is about 0.291 7timesfasterthanLC-KSVD. (AR).Incontract,theaveragesparsenessoftwolayersDSN isonaveragebelow0.02inthedatabases.Itcanbeseenthat 15SceneCategory:Followingthecommonexperimental the S-DSN can learn sparser representations. Due to space settings, we randomly choose 100 images from each class reasons, Figure 1 only visualizes the activation probabili- fortrainingdataandtherestfortestdata.Inourexperiments, ties of first hidden layer, which are computed under the S- (cid:15) = 20 and α = 0.1 are used in DSN; (cid:15) = 20, α = 0.1, DSN(relu)andtheDSNgivenanimagefromtestsetofAR. G = 4, and β = 0.05 are used in S-DSN(sigm); (cid:15) = 15, α=0.1,G=4,andβ =0.001areusedinS-DSN(relu). Results We compare our results with SRC (Wright et al. 2009), FaceRecognitionExtendedYaleB:Werandomlyselecthalf (32)oftheimagespercategoryastrainingandtheotherhalf Table 4: Inference Time (ms) for a Test Image on the Ex- for testing. The parameters are selected as follow: in DSN tendedYaleBDatabase Methods SRC LC-KSVD S-DSN(relu) 1they can be downloaded from: Averagetime 20.121 0.502 0.069 http://www.umiacs.umd.edu/∼zhuolin/projectlcksvd.html Table 5: Recognition Results Using Spatial Pyramid Fea- Table 6: Recognition Results Using Spatial Pyramid Fea- turesonthe15SceneCategoryDatabase turesontheCaltech101Database Methods Acc.(%) Methods Acc.(%) Methods 5 10 15 20 25 30 ITDL 81.1 ISPR+IFV 91.1 ScSPM - - 67.0 - - 73.2 SR-LSR 85.7 ScSPM 80.3 SRC 48.8 60.1 64.9 67.7 69.2 70.7 LLC 89.2 SRC 91.8 LLC 51.2 59.8 65.4 67.7 70.2 73.4 LC-KSVD 92.9 DeepSC 83.8 LC-KSVD 54.0 63.1 67.7 70.5 72.3 73.6 DeCAF 88.0 DSFL+DeCAF 92.8 LRSC 55.0 63.5 67.1 70.3 72.7 74.4 DSN-1 96.7 DSN-2 97.0 LCLR 53.4 62.8 67.2 70.8 72.9 74.7 S-DSN(sigm)-1 96.5 S-DSN(sigm)-2 97.1 DSN-1 53.6 61.8 67.7 70.2 72.0 74.6 S-DSN(relu)-1 98.8 S-DSN(relu)-2 98.8 DSN-2 54.9 63.4 68.2 70.5 72.9 74.7 S-DSN(sigm)-1 54.0 62.3 67.6 70.2 72.1 74.7 S-DSN(sigm)-2 55.4 63.7 68.3 70.8 73.2 74.9 S-DSN(relu)-1 55.4 63.8 68.7 71.2 73.5 76.0 S-DSN(relu)-2 55.6 64.2 69.0 71.3 73.6 76.2 isters about 1.5% improvement over a deep model: DSN. Notethat76.5%accuracyachievedbyourmethod(thenum- ber of hidden units is 1000) is also competitive with the 78.4%reportedinDeepSC. Figure 2: The confusion matrix on the 15 scene category database. LC-KSVD(Jiang,Lin,andDavis2013),DeepSC(Heetal. 2014),DSN(DengandYu2011b)andotherstate-of-the-art approaches: ScSPM (Yang et al. 2009), LLC (Wang et al. 2010), ITDL (Qiu, Patel, and Chellappa 2014), ISPR+IFV (Linetal.2014),SR-LSR(LiandGuo2014),DeCAF(Don- ahueetal.2014),DSFL+DeCAF(Zuoetal.2014).Thede- Figure 3: Effect of the number of hidden units used in S- tailed comparison results are shown in Table 5. Compared DSN(sigm),S-DSN(relu)andDSNonrecognitionaccuracy. We examine how performance of the proposed S-DSN toLC-KSVD,S-DSN(relu)-1’sperformanceismuchbetter, changeswhenvaryingthenumberofhiddenunits.Weran- since it makes a 5.9% improvement. It also registers about domlyselect30imagespercategoryfortrainingdataandthe 1.8%improvementoverthedeepmodels:DeepSC,DeCAF, rest for test data. We consider six settings where the num- DSFL+DeCAFandDSN.AsTable5shows,weseethatS- berofhiddenunitschangesfrom100to3000andcompare DSN(relu) performs best among all existing methods. The the results with DSN. As reported the results in Figure 3, confusion matrix for the S-DSN(relu) are further shown in our approaches maintain high classification accuracies and Figure 2, from which we can see that the misclassification outperform the DSN model. When increasing the number errorsofindustrialandstorearehigherthanothers. ofhiddenunits,theaccuracyofthesystemimprovesforS- Caltech101: Following the common experimental set- DSN(sigm),S-DSN(relu)andDSN. tings,wetrainon5,10,15,20,25,and30samplespercate- EffectsofNumberofLayers:Thedeepframeworkuti- goryandtestontherest.Duetospacereasons,weonlygive lizesmultiple-layersoffeatureabstractiontogetabetterrep- theparametersfor30trainingsamplespercategory:(cid:15)=0.2 resentationforimages.FromTables2,3,5and6,wecheck and α = 0.5 are used in DSN; (cid:15) = 0.2, α = 0.5, G = 4, the effect of varying the number of layers and the classifi- andβ =0.01areusedinS-DSN(sigm);(cid:15)=0.05,α=0.5, cationaccuracyimprovesasthenumberoflayersincreases. G=2,andβ =0.001areusedinS-DSN(relu). Inaddition,comparedtothedictionarylearningapproaches, Weevaluateourapproachusingspatialpyramidfeatures S-DSNhasafasterinferenceandadeeparchitecture.More- andcomparewithwithSRC(Wrightetal.2009),LC-KSVD over,S-DSNhasagoodperformanceinimageclassification. (Jiang,Lin,andDavis2013),DeepSC(Heetal.2014),DSN (Deng and Yu 2011b) and other approaches ScSPM (Yang Conclusion et al. 2009), LLC (Wang et al. 2010), LRSC (Zhang et al. 2013), LCLR (Jiang, Guo, and Peng 2014). The average Inthispaper,wepresentanimprovedDSNmodel,S-DSN, classification rates are reported in Table 6. From these re- for image classification. S-DSN is constructed by stacking sults, S-DSN(relu)-1 outperforms the other competing dic- many sparse SNNM modules. In each sparse SNNM mod- tionary learning approaches, including LC-KSVD, LRSC, ule,thelower-layerweightsandtheupper-layerweightsare andSRC;andhas1.6%improvement.S-DSN(relu)alsoreg- solved by using the convex optimization and the gradient descentalgorithm.WeusetheS-DSNtofurtherextractthe [Huangetal.2013] Huang, J.; Nie, F.; Huang, H.; and Ding, C. sparse representations from the random face features and 2013. Supervised and projected sparse coding for image classi- spatial pyramid features for image classification. Experi- fication.InProceedingsoftheAAAIConferenceonArtificialIntel- mental results show that S-DSN yields very good classifi- ligence,438–444. cation results on four public databases with only a linear [Hutchinson,Deng,andYu2013] Hutchinson, B.; Deng, L.; and classifier. Yu, D. 2013. Tensor deep stacking networks. IEEE Trans. on PatternAnalysisandMachineIntelligence35:1944–1957. Acknowledgments [Jiang,Guo,andPeng2014] Jiang,Z.;Guo,P.;andPeng,L. 2014. Locality-constrainedlow-rankcodingforimageclassification. In This work was partially supported by the National Sci- Proceedings of the AAAI Conference on Artificial Intelligence, ence Fund for Distinguished Young Scholars under Grant 2780–2786. Nos. 61125305, 61472187, 61233011 and 61373063, the [Jiang,Lin,andDavis2013] Jiang,Z.;Lin,Z.;andDavis,L. 2013. Key Project of Chinese Ministry of Education under Grant Label consistent k-svd learning a discriminative dictionary for No. 313030, the 973 Program No. 2014CB349303, Fun- recognition. IEEETrans.onPatternAnalysisandMachineIntelli- damental Research Funds for the Central Universities No. gence35:2651–2664. 30920140121005, and Program for Changjiang Scholars [Leeetal.2007] Lee,H.;Battle,A.;Raina,R.;andNg,A. 2007. andInnovativeResearchTeaminUniversityNo.IRT13072. Efficientsparsecodingalgorithms. InProceedingsoftheNeural Info.ProcessingSystems,801–808. References [Lee,Ekanadham,andNg2008] Lee,H.;Ekanadham,C.;andNg, [Bengioetal.2009] Bengio,S.;Pereira,F.;Singer,Y.;andStrelow, A. 2008. Sparse deep belief net model for visual area v2. In D. 2009. Groupsparsecoding. InProceedingsoftheNeuralInfo. ProceedingsoftheNeuralInfo.ProcessingSystems,873–880. ProcessingSystems,399–406. [Lennie2003] Lennie,P. 2003. Thecostofcorticalcomputation. [DengandYu2011a] Deng, L., and Yu, D. 2011a. Accelerated CurrentBiology13:493–497. parallelizableneuralnetworklearningalgorithmforspeechrecog- [LiandGuo2014] Li,X.,andGuo,Y. 2014. Latentsemanticrep- nition. InProceedingsoftheInterspeech,2281–2284. resentationlearningforsceneclassification. InProceedingsofthe [DengandYu2011b] Deng, L., and Yu, D. 2011b. Deep convex Int’lConf.MachineLearning,532–540. networks:Ascalablearchitectureforspeechpatternclassification. [Linetal.2014] Lin,D.;Lu,C.;Liao,R.;andJia,J.2014.Learning InProceedingsoftheInterspeech,2285–2288. importantspatialpoolingregionsforsceneclassification. InPro- [DengandYu2013] Deng, L., and Yu, D. 2013. Deep learning ceedingsoftheIEEEConf.ComputerVisionandPatternRecogni- forsignalandinformationprocessing. FoundationsandTrendsin tion,3726–3733. SignalProcessing2-3:197–387. [Luoetal.2011] Luo,H.;Shen,R.;Niu,C.;andUllrich,C. 2011. [Deng,He,andGao2013] Deng, L.; He, X.; and Gao, J. 2013. Sparsegrouprestrictedboltzmannmachines.InProceedingsofthe Deepstackingnetworksforinformationretrieval. InProceedings AAAIConferenceonArtificialIntelligence,429–433. ofIEEEConf.onAcoustics,Speech,andSignalProcessing,3153– [NairandHinton2010] Nair, V., and Hinton, G. 2010. Rectified 3157. linear units improve restricted boltzmann machines. In Proceed- [Deng,Yu,andPlatt2012] Deng, L.; Yu, D.; and Platt, J. 2012. ingsoftheInt’lConf.MachineLearning,807–814. Scalablestackingandlearningforbuildingdeeparchitectures. In ProceedingsofIEEEConf.onAcoustics,Speech,andSignalPro- [Qiu,Patel,andChellappa2014] Qiu,Q.;Patel,V.;andChellappa, cessing,2133–2136. R. 2014. Information-theoreticdictionarylearningforimageclas- sification. IEEE Trans. on Pattern Analysis and Machine Intelli- [Donahueetal.2014] Donahue, J.; Jia, Y.; Vinyals, O.; Hoffman, gence36:2173–2184. J.; Zhang, N.; Tzeng, E.; and Darrell, T. 2014. Decaf: A deep convolutionalactivationfeatureforgenericvisualrecognition. In [Ranzatoetal.2007] Ranzato,M.;Boureau,Y.;Chopra,S.;andLe- ProceedingsoftheInt’lConf.MachineLearning,647–655. Cun,Y. 2007. Efficientlearningofsparserepresentationswithan energy-basedmodel.InProceedingsoftheNeuralInfo.Processing [GlorotandBengio2010] Glorot, X., and Bengio, Y. 2010. Un- Systems,1137–1144. derstandingthedifficultyoftrainingdeepfeedforwardneuralnet- works. InProceedingsoftheInt’lConf.ArtificialIntelligenceand [Ranzato,Boureau,andLeCun2008] Ranzato, M.; Boureau, Y.; Statistics,249–256. andLeCun,Y. 2008. Sparsefeaturelearningfordeepbeliefnet- works. In Proceedings of the Neural Info. Processing Systems, [Glorot,Bordes,andBengio2011] Glorot, X.; Bordes, A.; and 1185–1192. Bengio,Y. 2011. Deepsparserectifierneuralnetworks. InPro- ceedings of the Int’l Conf. Artificial Intelligence and Statistics, [Wangetal.2010] Wang, J.; Yang, J.; Yu, K.; Lv, F.; Huang, T.; 315–323. andGong,Y. 2010. Locality-constrainedlinearcodingforimage classification. InProceedingsoftheIEEEConf.ComputerVision [GregorandLeCun2010] Gregor,K.,andLeCun,Y. 2010. Learn- andPatternRecognition,3360–3367. ing fast approximations of sparse coding. In Proceedings of the Int’lConf.MachineLearning,399–406. [Wolpert1992] Wolpert,D. 1992. Stackedgeneralization. Neural Networks5:241–259. [Heetal.2014] He, Y.; Kavukcuogluy, K.; Wang, Y.; Szlam, A.; and Qi, Y. 2014. Unsupervised feature learning by deep sparse [Wrightetal.2009] Wright, J.; Yang, M.; Ganesh, A.; Sastry, S.; coding. InProceedingsofSIAMInt’lConf.onDataMining,902– andMa,Y.2009.Robustfacerecognitionviasparserepresentation. 910. IEEETrans.onPatternAnalysisandMachineIntelligence31:210– [Hoyer2004] Hoyer, P. 2004. Non-negative matrix factorization 227. with sparseness constraints. Journal of Machine Learning Re- [Yangetal.2009] Yang,J.;Yu,K.;Gong,Y.;andHuang,T. 2009. search5:1457–1469. Linear spatial pyramid matching using sparse coding for image classification. InProceedingsofIEEEConf.ComputerVisionand PatternRecognition,1794–1801. [Yangetal.2012] Yang,J.;Zhang,L.;Xu,Y.;andYang,J. 2012. Beyondsparsity:Theroleofl1-optimizerinpatternclassification. PatternRecognition45:1104–1118. [Zhangetal.2013] Zhang, T.; Ghanem, B.; Liu, S.; Xu, C.; and Ahuja,N. 2013. Low-ranksparsecodingforimageclassification. InProceedingsoftheIEEEInt’lConf.ComputerVision,281–288. [Zhuangetal.2013] Zhuang,Y.;Wang,Y.;Wu,F.;Zhang,Y.;and Lu,W. 2013. Supervisedcoupleddictionarylearningwithgroup structures for multi-modal retrieval. In Proceedings of the AAAI ConferenceonArtificialIntelligence,1070–1076. [Zuoetal.2014] Zuo,Z.;Wang,G.;Shuai,B.;Zhao,L.;Yang,Q.; and Jiang, X. 2014. Learning discriminative and shareable fea- tures for scene classification. In Proceedings of the Euro. Conf. ComputerVision,552–568.