UnderreviewasaconferencepaperatICLR2017 MODULARIZED MORPHING OF NEURAL NETWORKS TaoWei† ChanghuWang ChangWenChen UniversityatBuffalo MicrosoftResearch UniversityatBuffalo Buffalo,NY14260 Beijing,China,100080 Buffalo,NY14260 [email protected] [email protected] [email protected] ABSTRACT 7 In this work we study the problem of network morphism, an effective learning 1 scheme to morph a well-trained neural network to a new one with the network 0 functioncompletelypreserved. Differentfromexistingworkwherebasicmorph- 2 ingtypesonthelayerlevelwereaddressed,wetargetatthecentralproblemofnet- n workmorphismatahigherlevel,i.e.,howaconvolutionallayercanbemorphed a intoanarbitrarymoduleofaneuralnetwork. Tosimplifytherepresentationofa J network,weabstractamoduleasagraphwithblobsasverticesandconvolutional 2 layersasedges,basedonwhichthemorphingprocessisabletobeformulatedasa 1 graphtransformationproblem.Twoatomicmorphingoperationsareintroducedto composethegraphs,basedonwhichmodulesareclassifiedintotwofamilies,i.e., ] simplemorphablemodulesandcomplexmodules. Wepresentpracticalmorphing G solutionsforbothofthesetwofamilies,andprovethatanyreasonablemodulecan L bemorphedfromasingleconvolutionallayer. Extensiveexperimentshavebeen s. conducted based on the state-of-the-art ResNet on benchmark datasets, and the c effectivenessoftheproposedsolutionhasbeenverified. [ 1 v 1 INTRODUCTION 1 8 Deep convolutional neural networks have continuously demonstrated their excellent performances 2 ondiversecomputervisionproblems. Inimageclassification,themilestonesofsuchnetworkscan 3 beroughlyrepresentedbyLeNet(LeCunetal.,1989),AlexNet(Krizhevskyetal.,2012),VGGnet 0 (Simonyan&Zisserman,2014), GoogLeNet(Szegedyetal.,2014), andResNet(Heetal.,2015), . 1 withnetworksbecomingdeeperanddeeper. However,thearchitecturesofthesenetworkaresignifi- 0 cantlyalteredandhencearenotbackward-compatible. Consideringalife-longlearningsystem,itis 7 highlydesiredthatthesystemisabletoupdateitselffromtheoriginalversionestablishedinitially, 1 andthenevolveintoamorepowerfulone,ratherthanre-learningabrandnewonefromscratch. : v i Network morphism (Wei et al., 2016) is an effective way towards such an ambitious goal. It can X morph a well-trained network to a new one with the knowledge entirely inherited, and hence is r abletoupdatetheoriginalsystemtoacompatibleandmorepowerfulonebasedonfurthertraining. a Networkmorphismisalsoaperformanceboosterandarchitectureexplorerforconvolutionalneural networks, allowingustoquicklyinvestigatenewmodelswithsignificantlylesscomputationaland human resources. However, the network morphism operations proposed in (Wei et al., 2016), in- cludingdepth,width,andkernelsizechanges,arequiteprimitiveandhavebeenlimitedtothelevel oflayerinanetwork. Forpracticalapplicationswhereneuralnetworksusuallyconsistofdozensor evenhundredsoflayers,themorphingspacewouldbetoolargeforresearcherstopracticallydesign the architectures of target morphed networks, when based on these primitive morphing operations only. Differentfrompreviouswork,weinvestigateinthisresearchthenetworkmorphismfromahigher levelofviewpoint,andsystematicallystudythecentralproblemofnetworkmorphismonthemod- ule level, i.e., whether and how a convolutional layer can be morphed into an arbitrary module1, whereamodulereferstoasingle-source,single-sinkacyclicsubnetofaneuralnetwork. Withthis †TaoWeiperformedthisworkwhilebeinganinternatMicrosoftResearchAsia. 1Althoughnetworkmorphismgenerallydoesnotimposeconstraintsonthearchitectureofthechildnetwork, inthisworkwelimittheinvestigationtotheexpandingmode. 1 UnderreviewasaconferencepaperatICLR2017 modularizednetworkmorphing, insteadofmorphinginthelayerlevelwherenumerousvariations existinadeepneuralnetwork,wefocusonthechangesofbasicmodulesofnetworks,andexplore themorphingspaceinamoreefficientway. Thenecessitiesforthisstudyaretwofolds. First,we wish to explore the capability of the network morphism operations and obtain a theoretical upper boundforwhatweareabletodowiththislearningscheme. Second,modernstate-of-the-artconvo- lutionalneuralnetworkshavebeendevelopedwithmodularizedarchitectures(Szegedyetal.,2014; Heetal.,2015),whichstacktheconstructionunitsfollowingthesamemoduledesign. Itishighly desiredthatthemorphingoperationscouldbedirectlyappliedtothesenetworks. To study the morphing capability of network morphism and figure out the morphing process, we introduceasimplifiedgraph-basedrepresentationforamodule.Thus,thenetworkmorphingprocess canbeformulatedasagraphtransformationprocess. Inthisrepresentation,themoduleofaneural networkisabstractedasadirectedacyclicgraph(DAG),withdatablobsinthenetworkrepresented asverticesandconvolutionallayersasedges. Furthermore,avertexwithmorethanoneoutdegree (orindegree)implicitlyincludesasplitofmultiplecopiesofblobs(orajointofaddition). Indeed, the proposed graph abstraction suffers from the problem of dimension compatibility of blobs, for different kernel filters may result in totally different blob dimensions. We solve this problem by extendingtheblobandfilterdimensionsfromfinitetoinfinite,andtheconvergencepropertieswill alsobecarefullyinvestigated. Two atomic morphing operations are adopted as the basis for the proposed graph transformation, based on which a large family of modules can be transformed from a convolutional layer. This familyofmodulesarecalledsimplemorphablemodulesinthiswork.Anovelalgorithmisproposed to identify the morphing steps by reducing the module into a single convolutional layer. For any moduleoutsidethesimplemorphablefamily,i.e.,complexmodule,wefirstapplythesamereduction process and reduce it to an irreducible module. A practical algorithm is then proposed to solve for the network morphism equation of the irreducible module. Therefore, we not only verify the morphabilitytoanarbitrarymodule,butalsoprovideaunifiedmorphingsolution.Thisdemonstrates thegeneralizationabilityandthuspracticalityofthislearningscheme. ExtensiveexperimentshavebeenconductedbasedonResNet(Heetal.,2015)toshowtheeffective- nessoftheproposedmorphingsolution. Withonly1.2xorlesscomputation,themorphednetwork can achieve up to 25% relative performance improvement over the original ResNet. Such an im- provementissignificantinthesensethatthemorphed20-layerednetworkisabletoachieveanerror rate of 6.60% which is even better than a 110-layered ResNet (6.61%) on the CIFAR10 dataset, withonlyaround1/5ofthecomputationalcost. Itisalsoexcitingthatthemorphed56-layerednet- workisabletoachieve5.37%errorrate,whichisevenlowerthanthoseofResNet-110(6.61%)and ResNet-164(5.46%). Theeffectivenessoftheproposedlearningschemehasalsobeenverifiedon theCIFAR100andImageNetdatasets. 2 RELATED WORK KnowledgeTransfer. Networkmorphismoriginatedfromknowledgetransferringforconvolutional neuralnetworks. Earlyattemptswereonlyabletotransferpartialknowledgeofawell-trainednet- work. Forexample,aseriesofmodelcompressiontechniques(Buciluetal.,2006;Ba&Caruana, 2014; Hinton et al., 2015; Romero et al., 2014) were proposed to fit a lighter network to predict the output of a heavier network. Pre-training (Simonyan & Zisserman, 2014) was adopted to pre- initializecertainlayersofadeepernetworkwithweightslearnedfromashallowernetwork. How- ever,networkmorphismrequirestheknowledgebeingfullytransferred,andexistingworkincludes Net2Net(Chenetal.,2015)andNetMorph(Weietal.,2016).Net2Netachievedthisgoalbypadding identitymappinglayersintotheneuralnetwork,whileNetMorphdecomposedaconvolutionallayer intotwolayersbydeconvolution. Notethatthenetworkmorphismoperationsin(Chenetal.,2015; Wei et al., 2016) are quite primitive and at a micro-scale layer level. In this research, we study thenetworkmorphismatameso-scalemodulelevel,andinparticular,weinvestigateitsmorphing capability. ModularizedNetworkArchitecture. Theevolutionofconvolutionalneuralnetworkshasbeenfrom sequential to modularized. For example, LeNet (LeCun et al., 1998), AlexNet (Krizhevsky et al., 2012), and VGG net (Simonyan & Zisserman, 2014) are sequential networks, and their difference is primarily on the number of layers, which is 5, 8, and up to 19 respectively. However, recently 2 UnderreviewasaconferencepaperatICLR2017 𝑠 𝐺𝑙 𝑡 Conv 𝐵𝑖 𝐺 𝐵𝑗 𝑙 𝐹1 𝑙 𝑠 𝑡 𝑠 𝑡 𝐹2 Conv Conv 𝑙 𝐵𝑖 𝐹 𝐵𝑙 𝐹 𝐵𝑗 𝑙 𝑙+1 TYPE-I TYPE-II (a)Networkmorphismindepth. (b)Atomicmorphingtypes. Figure 1: Illustration of atomic morphing types. (a) One convolutional layer is morphed into two convolutionallayers;(b)TYPE-IandTYPE-IIatomicmorphingtypes. proposednetworks,suchasGoogLeNet(Szegedyetal.,2014;2015)andResNet(Heetal.,2015), followamodularizedarchitecturedesign,andhaveachievedthestate-of-the-artperformance. This is why we wish to study network morphism at the module level, so that its operations are able to directlyapplytothesemodularizednetworkarchitectures. 3 NETWORK MORPHISM VIA GRAPH ABSTRACTION In this section, we present a systematic study on the capability of network morphism learning scheme. We shall verify that a convolutional layer is able to be morphed into any single-source, single-sinkDAGsubnet,namedasamodulehere. Weshallalsopresentthecorrespondingmorph- ingalgorithms. Forsimplicity,wefirstconsiderconvolutionalneuralnetworkswithonlyconvolutionallayers. All other layers, including the non-linearity and batch normalization layers, will be discussed later in thispaper. 3.1 BACKGROUNDANDBASICNOTATIONS Fora2Ddeepconvolutionalneuralnetwork(DCNN),asshowninFig.1a,theconvolutionisdefined by: (cid:88) B (c )= B (c )∗G (c ,c ), (1) j j i i l j i ci where the blob B is a 3D tensor of shape (C ,H ,W ) and the convolutional filter G is a 4D ∗ ∗ ∗ ∗ l tensor of shape (C ,C ,K ,K ). In addition, C , H , and W represent the number of channels, j i l l ∗ ∗ ∗ heightandwidthofB . K istheconvolutionalkernelsize2. ∗ l In a network morphism process, the convolutional layer G in the parent network is morphed into l two convolutional layers F and F (Fig. 1a), where the filters F and F are 4D tensors of l l+1 l l+1 shapes(C ,C ,K ,K )and(C ,C ,K ,K ). Thisprocessshouldfollowthemorphismequation: l i 1 1 j l 2 2 G˜(c ,c )=(cid:88)F (c ,c )∗F (c ,c ), (2) l j i l l i l+1 j l cl whereG˜ isazero-paddedversionofG whoseeffectivekernelsizeisK˜ = K +K −1 ≥ K . l l l 1 2 l (Weietal.,2016)showedasufficientconditionforexactnetworkmorphism: max(C C K2,C C K2)≥C C (K +K −1)2. (3) l i 1 j l 2 j i 1 2 Forsimplicity,weshalldenoteequations(1)and(2)asB =G (cid:126)B andG˜ =F (cid:126)F ,where j l i l l+1 l (cid:126)isanon-communicativemulti-channelconvolutionoperator. Wecanalsorewriteequation(3)as max(|F |,|F |)≥|G˜|,where|∗|measuresthesizeoftheconvolutionalfilter. l l+1 l 2Generallyspeaking,G isa4Dtensorofshape(C ,C ,KH,KW),whereconvolutionalkernelsizesfor l j i l l blobheightandwidtharenotnecessarytobethesame.However,inordertosimplifythenotations,weassume thatKH =KW,buttheclaimsandtheoremsinthispaperapplyequallywellwhentheyaredifferent. l l 3 UnderreviewasaconferencepaperatICLR2017 3.2 ATOMICNETWORKMORPHISM Westartwiththesimplestcases.Twoatomicmorphingtypesareconsidered,asshowninFig.1b:1) aconvolutionallayerismorphedintotwoconvolutionallayers(TYPE-I);2)aconvolutionallayeris morphedintotwo-wayconvolutionallayers(TYPE-II).FortheTYPE-Iatomicmorphingoperation, equation(2)issatisfied,whileForTYPE-II,thefiltersplitissettosatisfy G =F1+F2. (4) l l l Inaddition,forTYPE-II,atthesourceend,theblobissplitwithmultiplecopies;whileatthesink end,theblobsarejoinedbyaddition. 3.3 GRAPHABSTRACTION Tosimplifytherepresentation,weintroducethefollowinggraphabstractionfornetworkmorphism. Foraconvolutionalneuralnetwork,weareabletoabstractitasagraph,withtheblobsrepresented byvertices, andconvolutionallayersbyedges. Formally, aDCNNisrepresentedasaDAGM = (V,E),whereV = {B }N areblobsandE = {e = (B ,B )}L areconvolutionallayers. Each i i=1 l i j l=1 convolutional layer e connects two blobs B and B , and is associated with a convolutional filter l i j F . Furthermore,inthisgraph,ifoutdegree(B )>1,itimplicitlymeansasplitofmultiplecopies; l i andifindegree(B )>1,itisajointofaddition. i Basedonthisabstraction,weformallyintroducethefollowingdefinitionformodularnetworkmor- phism: Definition1. LetM =({s,t},e )representthegraphwithonlyasingleedgee thatconnectsthe 0 0 0 sourcevertexsandsinkvertext. M = (V,E)representsanysingle-source,single-sinkDAGwith thesamesourcevertexsandthesamesinkvertext. WecallsuchanM asamodule. Ifthereexists a process that we are able to morph M to M, then we say that module M is morphable, and the 0 morphingprocessiscalledmodularnetworkmorphism. Hence, basedonthisabstraction, modularnetworkmorphismcanberepresentedasagraphtrans- formationproblem. AsshowninFig. 2b, module(C)inFig. 2acanbetransformedfrommodule M byapplyingtheillustratednetworkmorphismoperations. 0 Foreachmodularnetworkmorphing,amodularnetworkmorphismequationisassociated: Definition2. Eachmoduleessentiallycorrespondstoafunctionfromstot,whichiscalledamodule function. ForamodularnetworkmorphismprocessfromM toM,theequationthatguaranteesthe 0 modulefunctionunchangediscalledmodularnetworkmorphismequation. It is obvious that equations (2) and (4) are the modular network morphism equations for TYPE-I and TYPE-II atomic morphing types. In general, the modular network morphism equation for a module M is able to be written as the sum of all convolutional filter compositions, in which each composition is actually a path from s to t in the module M. Let {(F ,F ,··· ,F ) : p = p,1 p,2 p,ip 1,··· ,P, andi isthelengthofpathp}bethesetofallsuchpathsrepresentedbytheconvolutional p filters. ThenthemodularnetworkmorphismequationformoduleM canbewrittenas (cid:88) G = F (cid:126)F (cid:126)···(cid:126)F . (5) l p,ip p,ip−1 p,1 p As an example illustrated in Fig. 2a, there are four paths in module (D), and its modular network morphismequationcanbewrittenas G =F (cid:126)F +F (cid:126)(F (cid:126)F +F (cid:126)F )+F (cid:126)F , (6) l 5 1 6 3 1 4 2 7 2 whereG istheconvolutionalfilterassociatedwithe inmoduleM . l 0 0 3.4 THECOMPATIBILITYOFNETWORKMORPHISMEQUATION Onedifficultyinthisgraphabstractionisinthedimensionalcompatibilityofconvolutionalfiltersor blobs. Forexample,fortheTYPE-IIatomicmorphinginFig. 1b,wehavetosatisfyG =F1+F2. l l l SupposethatG andF2 areofshape(64,64,3,3),whileF1 is(64,64,1,1),theyareactuallynot l l l addable. Formally,wedefinethecompatibilityofmodularnetworkmorphismequationasfollows: 4 UnderreviewasaconferencepaperatICLR2017 𝑠 𝑡 𝑠 𝑡 𝑠 𝑡 𝑠 𝑡 𝑠 𝑡 (A) (B) Module (C) 𝑠 𝑡 𝑠𝐹1 𝐹3 𝐹6 𝐹5 𝑡 𝑠 𝑡 𝑠 𝑡 𝑠 𝑡 ? 𝐹4 𝐹2 𝐹7 (C) (D) Module (D) (a)Examplemodules. (b)Morphingprocessformodule(C)and(D). Figure 2: Example modules and morphing processes. (a) Modules (A)-(C) are simple morphable, while(D)isnot; (b)amorphingprocessformodule(C),whileformodule(D),wearenotableto findsuchaprocess. Definition3. ThemodularnetworkmorphismequationforamoduleM iscompatibleifandonly if the mathematical operators between the convolutional filters involved in this equation are well- defined. Inordertosolvethiscompatibilityproblem,weneednottoassumethatblobs{B }andfilters{F } i l arefinitedimensiontensors. Insteadtheyareconsideredasinfinitedimensiontensorsdefinedwitha finitesupport3,andwecallthisasanextendeddefinition. Aninstantadvantagewhenweadoptthis extendeddefinitionisthatwewillnolongerneedtodifferentiateG andG˜ inequation(2), since l l G˜ issimplyazero-paddedversionofG . l l Lemma4. Theoperations+and(cid:126)arewell-definedforthemodularnetworkmorphismequation. Namely,ifF1 andF2 areinfinitedimension4Dtensorswithfinitesupport,letG = F1+F2 and H =F2(cid:126)F1,thenbothGandH areuniquelydefinedandalsohavefinitesupport. SketchofProof. Itisquiteobviousthatthislemmaholdsfortheoperator+. Fortheoperator(cid:126),if wehavethisextendeddefinition,thesuminequation(2)willbecomeinfiniteovertheindexc . Itis l straightforwardtoshowthatthisinfinitesumconverges,andalsothatH isfinitelysupportedwith respecttotheindicesc andc . HenceH hasfinitesupport. j i Asacorollary,wehave: Corollary5. ThemodularnetworkmorphismequationforanymoduleM isalwayscompatibleif thefiltersinvolvedinM areconsideredasinfinitedimensiontensorswithfinitesupport. 3.5 SIMPLEMORPHABLEMODULES In this section, we introduce a large family of modules, i.e, simple morphable modules, and then providetheirmorphingsolutions. Wefirstintroducethefollowingdefinition: Definition 6. A module M is simple morphable if and only if it is able to be morphed with only combinationsofatomicmorphingoperations. Several example modules are shown in Fig. 2a. It is obvious that modules (A)-(C) are simple morphable,andthemorphingprocessformodule(C)isalsoillustratedinFig. 2b. For a simple morphable module M, we are able to identity a morphing sequence from M to 0 M. The algorithm is illustrated in Algorithm 1. The core idea is to use the reverse operations of atomic morphing types to reduce M to M . Hence, the morphing process is just the reverse 0 of the reduction process. In Algorithm 1, we use a four-element tuple (M,e ,{e ,e },type) to 1 2 3 representtheprocessofmorphingedgee1 inmoduleM to{e2,e3}usingTYPE-<TYPE>atomic operation. TwoauxiliaryfunctionsCHECKTYPEIandCHECKTYPEIIarefurtherintroduced. Both 3A support of a function is defined as the set of points where the function value is non-zero, i.e., support(f)={x|f(x)(cid:54)=0}. 5 UnderreviewasaconferencepaperatICLR2017 Algorithm1AlgorithmforSimpleMorphableModules Input: M ;asimplemorphablemoduleM 0 Output: ThemorphingsequenceQthatmorphsM toM usingatomicmorphingoperations 0 Q=∅ whileM (cid:54)=M do 0 whileCHECKTYPEI(M)isnotFALSEdo //Let(Mtemp,e1,{e2,e3},type)bethereturnvalueofCHECKTYPEI(M) Q.prepend((M ,e ,{e ,e },type))andM ←M temp 1 2 3 temp endwhile whileCHECKTYPEII(M)isnotFALSEdo //Let(Mtemp,e1,{e2,e3},type)bethereturnvalueofCHECKTYPEII(M) Q.prepend((M ,e ,{e ,e },type))andM ←M temp 1 2 3 temp endwhile endwhile Algorithm2AlgorithmforIrreducibleModules Input: G ;anirreduciblemoduleM l Output: Convolutionalfilters{F }n ofM i i=1 Initialize{F }n withrandomnoise. i i=1 CalculatetheeffectivekernelsizeofM,expandG toG˜ bypaddingzeros. l l repeat for j =1tondo Fix{F :i(cid:54)=j},andcalculateF =deconv(G˜,{F :i(cid:54)=j}) i j l i Calculatelossl=(cid:107)G˜ −conv({F }n )(cid:107)2 l i i=1 endfor untill=0ormaxIterisreached of them return either FALSE if there is no such atomic sub-module in M, or a morphing tuple (M,e1,{e2,e3},type)ifthereis. Thealgorithmof CHECKTYPEI onlyneedstofindavertexsat- isfyingindegree(Bi) = outdegree(Bi) = 1,while CHECKTYPEII looksforthematrixelements >1intheadjacentmatrixrepresentationofmoduleM. Isthereamodulenotsimplemorphable? Theanswerisyes, andanexampleisthemodule(D)in Fig. 2a. AsimpletrydoesnotworkasshowninFig. 2b. Infact,wehavethefollowingproposition: Proposition7. Module(D)inFig. 2aisnotsimplemorphable. SketchofProof. AsimplemorphablemoduleM isalwaysabletoberevertedbacktoM .However, 0 formodule(D)inFig. 2a,bothCHECKTYPEIandCHECKTYPEIIreturnFALSE. 3.6 MODULARNETWORKMORPHISMTHEOREM Foramodulethatisnotsimplemorphable,whichiscalledacomplexmodule,weareabletoapply Algorithm 1 to reduce it to an irreducible module M first. For M, we propose Algorithm 2 to solve the modular network morphism equation. The core idea of this algorithm is that, if only oneconvolutionalfilterisallowedtochangewithallothersfixed, themodularnetworkmorphism equation will reduce to a linear system. The following argument guarantees the correctness of Algorithm2. CorrectnessofAlgorithm2. Let G and {F }n be the convolutional filter(s) associated with M l i i=1 0 andM. Wefurtherassumethatoneof{F },e.g.,F ,islargerorequaltoG˜,whereG˜ isthezero- i j l l padded version of G (this assumption is a strong condition in the expanding mode). The module l networkmorphismequationforM canbewrittenas G˜ =C (cid:126)F (cid:126)C +C , (7) l 1 j 2 3 6 UnderreviewasaconferencepaperatICLR2017 3x3 3x3 BN ReLU BN Conv Conv 𝐼𝑑 Input + ReLU Output (a) ResNetmodule: 3x3 3x3 BN ReLU BN Conv Conv 0.5𝐼𝑑 Input + ReLU Output 1x1 1x1 BN PReLU BN Conv Conv (b) morph_1c1 module: Figure3: DetailedarchitecturesoftheResNetmoduleandthemorph 1c1module. 3x3 3x3 3x3 3x3 3x3 3x3 3x3 3x3 3x3 3x3 0.5 0.5 0.5 1x1 1x1 1x1 1x1 3x3 1x1 3x3 3x3 1x1 1x1 (a) ResNet (b) morph_1c1 (c) morph_3c1 (d) morph_3c3 (e) morph_1c1_2branch Figure4: Samplemodulesadoptedintheproposedexperiments. (a)and(b)arethegraphabstrac- tionsofmodulesillustratedinFig. 3(a)and(b). whereC ,C ,andC arecomposedofotherfilters{F : i (cid:54)= j}. Itcanbecheckedthatequation 1 2 3 i (7)isalinearsystemwith|G˜|constraintsand|F |freevariables. Sincewehave|F | ≥ |G˜|,the l j j l systemisnon-deterministicandhencesolvableasrandommatricesarerarelyinconsistent. ForageneralmoduleM,whethersimplemorphableornot,weapplyAlgorithm1toreduceM toan irreduciblemoduleM(cid:48),andthenapplyAlgorithm2toM(cid:48). Hencewehavethefollowingtheorem: Theorem 8. A convolutional layer can be morphed to any module (any single-source, single-sink DAGsubnet). This theorem answers the core question of network morphism, and provides a theoretical upper boundforthecapabilityofthislearningscheme. 3.7 NON-LINEARITYANDBATCHNORMALIZATIONINMODULARNETWORKMORPHISM Besidestheconvolutionallayers,aneuralnetworkmoduletypicallyalsoinvolvesnon-linearitylay- ersandbatchnormalizationlayers,asillustratedinFig. 3. Inthissection,weshalldescribehowdo wehandletheselayersformodularnetworkmorphism. For the non-linear activation layers, we adopt the solution proposed in (Wei et al., 2016). Instead of directly applying the non-linear activations, we are using their parametric forms. Let ϕ be any non-linearactivationfunction,anditsparametricformisdefinedtobe P-ϕ={ϕa}| ={(1−a)·ϕ+aϕ }| . (8) a∈[0,1] id a∈[0,1] Theshapesof theparametricformofthenon-linear activationϕ iscontrolledby theparametera. Whenaisinitialized(a=1),theparametricformisequivalenttoanidentityfunction,andwhenthe valueofahasbeenlearned,theparametricformwillbecomeanon-linearactivation. InFig. 3b,the non-linearactivationforthemorphingprocessisannotatedasPReLUtodifferentiateitselfwiththe 7 UnderreviewasaconferencepaperatICLR2017 Table 1: Experimental results of networks morphed from ResNet-20, ResNet-56, and ResNet-110 ontheCIFAR10dataset. Resultsannotatedwith†arefrom(Heetal.,2015). Intermediate Abs.Perf. Rel.Perf. #Params. #Params. NetArch. Error FLOP(million) Rel.FLOP Phases Improv. Improv. (MB) Rel. resnet20† - 8.75% - - 1.048 1x 40.8 1x morph201c1 - 7.35% 1.40% 16.0% 1.138 1.09x 44.0 1.08x - 7.10% 1.65% 18.9% morph203c1 1.466 1.40x 56.5 1.38x 1c1 6.83% 1.92% 21.9% - 6.97% 1.78% 20.3% morph203c3 1.794 1.71x 69.1 1.69x 1c1,3c1 6.66% 2.09% 23.9% - 7.26% 1.49% 17.0% morph201c12branch 1.227 1.17x 47.1 1.15x 1c1,half 6.60% 2.15% 24.6% resnet56† - 6.97% - - 3.289 1x 125.7 1x morph561c1half - 5.68% 1.29% 18.5% 3.468 1.05x 132.0 1.05x - 5.94% 1.03% 14.8% morph561c1 3.647 1.11x 138.3 1.10x 1c1half 5.37% 1.60% 23.0% resnet110† - 6.61%±0.16 - - 6.649 1x 253.1 1x morph1101c1half - 5.74% 0.87% 13.2% 7.053 1.06x 267.3 1.06x - 5.93% 0.68% 10.3% morph1101c1 7.412 1.11x 279.9 1.11x 1c1half 5.50% 1.11% 16.8% Table 2: Comparison results between learning from morphing and learning from scratch for the samenetworkarchitecturesontheCIFAR10dataset. Error Error Abs.Perf. Rel.Perf. NetArch. (scratch) (morph) Improv. Improv. morph20 1c1 8.01% 7.35% 0.66% 8.2% morph20 1c1 2branch 7.90% 6.60% 1.30% 16.5% morph56 1c1 7.37% 5.37% 2.00% 27.1% morph110 1c1 8.16% 5.50% 2.66% 32.6% other ReLU activations. In the proposed experiments, for simplicity, all ReLUs are replaced with PReLUs. Thebatchnormalizationlayers(Ioffe&Szegedy,2015)canberepresentedas data−mean newdata= √ ·gamma+beta. (9) var+eps √ Itisobviousthatifwesetgamma = var+epsandbeta = mean,thenabatchnormalization layerisreducedtoanidentitymappinglayer,andhenceitcanbeinsertedanywhereinthenetwork. Although it is possible to calculate the values of gamma and beta from the training data, in this research, we adopt another simpler approach by setting gamma = 1 and beta = 0. In fact, the valueofgammacanbesettoanynonzeronumber,sincethescaleisthennormalizedbythelatter batchnormalizationlayer(lowerrightoneinFig. 3b). Mathematicallyandstrictlyspeaking,when wesetgamma = 0,thenetworkfunctionisactuallychanged. However,sincethemorphedfilters fortheconvolutionallayersareroughlyrandomized, eventhoughthemeanofdataisnotstrictly zero, itisstillapproximatelyzero. Pluswiththefactthatthedataisthennormalizedbythelatter batchnormalizationlayer,suchsmallperturbationforthenetworkfunctionchangecanbeneglected. Intheproposedexperiments,onlystatisticalvariancesinperformanceareobservedforthemorphed networkwhenweadoptsettinggammatozero.Thereasonweprefersuchanapproachtousingthe trainingdataisthatitiseasiertoimplementandalsoyieldsslightlybetterresultswhenwecontinue totrainthemorphednetwork. 4 EXPERIMENTAL RESULTS Inthissection,wereporttheresultsoftheproposedmorphingalgorithmsbasedoncurrentstate-of- the-artResNet(Heetal.,2015),whichisthewinnerof2015ImageNetclassificationtask. 8 UnderreviewasaconferencepaperatICLR2017 Table 3: Experimental results of networks morphed from ResNet-20, ResNet-56, and ResNet-110 ontheCIFAR100dataset. Intermediate Abs.Perf. Rel.Perf. #Params. #Params. NetArch. Error FLOP(million) Rel.FLOP Phases Improv. Improv. (MB) Rel. resnet20 - 32.82% - - 1.070 1x 40.8 1x morph20 1c1 - 31.70% 1.12% 3.4% 1.160 1.08x 44.0 1.08x resnet56 - 29.83% - - 3.311 1x 125.8 1x morph56 1c1 1c1 half 27.52% 2.31% 7.7% 3.670 1.11x 138.3 1.10x resnet110 - 28.46% - - 6.672 1x 253.2 1x morph110 1c1 1c1 half 26.81% 1.65% 5.8% 7.434 1.11x 279.9 1.11x Table 4: Comparison results between learning from morphing and learning from scratch for the samenetworkarchitecturesontheCIFAR100dataset. Error Error Abs.Perf. Rel.Perf. NetArch. (scratch) (morph) Improv. Improv. morph20 1c1 33.63% 31.70% 1.93% 5.7% morph56 1c1 32.58% 27.52% 5.06% 15.5% morph110 1c1 31.94% 26.81% 5.13% 16.1% 4.1 NETWORKARCHITECTURESOFMODULARNETWORKMORPHISM We first introduce the network architectures used in the proposed experiments. Fig. 3a shows the module template in the design of ResNet (He et al., 2015), which is actually a simple morphable two-way module. The first path consists of two convolutional layers, and the second path is a shortcutconnectionofidentitymapping. ThearchitectureoftheResNetmodulecanbeabstracted as the graph in Fig. 4a. For the morphed networks, we first split the identity mapping layer in the ResNet module into two layers with a scaling factor of 0.5. Then each of the scaled identity mappinglayersisabletobefurthermorphedintotwoconvolutionallayers. Fig. 3billustratesthe case with only one scaled identity mapping layer morphed into two convolutional layers, and its equivalentgraphabstractionisshowninFig. 4b. Todifferentiatenetworkarchitecturesadoptedin this research, the notation morph <k1>c<k2> is introduced, where k1 and k2 are kernel sizes in the morphed network. If both of scaled identity mapping branches are morphed, we append a suffixof‘ 2branch’. SomeexamplesofmorphedmodulesareillustratedinFig. 4. Wealsouse thesuffix‘ half’toindicatethatonlyonehalf(odd-indexed)ofthemodulesaremorphed,andthe otherhalfareleftasoriginalResNetmodules. resnet morphnet resnet morphnet 10 34 32.82 9 8.75 %) 8 %)32 31.70 ate ( 7 6.60 6.97 6.61 ate (30 29.83 28.46 error r 56 5.37 5.50 error r2286 27.52 26.81 4 3 24 20-layer 56-layer 110-layer 20-layer 56-layer 110-layer (a)CIFAR10 (b)CIFAR100 Figure 5: Comparison results of ResNet and morphed networks on the CIFAR10 and CIFAR100 datasets. 9 UnderreviewasaconferencepaperatICLR2017 4.2 EXPERIMENTALRESULTSONTHECIFAR10DATASET CIFAR10(Krizhevsky&Hinton,2009)isabenchmarkdatasetonimageclassificationandneural network investigation. It consists of 32×32 color images in 10 categories, with 50,000 training imagesand10,000testingimages. Inthetrainingprocess,wefollowthesamesetupasin(Heetal., 2015). Weuseadecayof0.0001andamomentumof0.9. Weadoptthesimpledataaugmentation withapadof4pixelsoneachsideoftheoriginalimage. A32×32viewisrandomlycroppedfrom thepaddedimageandarandomhorizontalflipisoptionallyapplied. Table1showstheresultsofdifferentnetworksmorphedfromResNet(Heetal.,2015).Noticethatit isverychallengingtofurtherimprovetheperformance,forResNethasalreadyboostedthenumber to a very high level. E.g., ResNet (He et al., 2015) made only 0.36% performance improvement byextendingthemodelfrom56to110layers(Table1). FromTable1wecanseethat, withonly 1.2xorlesscomputationalcost,themorphednetworksachieved2.15%,1.60%,1.11%performance improvements over the original ResNet-20, ResNet-56, and ResNet-110 respectively. Notice that the relative performance improvement can be up to 25%. Table 1 also compares the number of parametersoftheoriginalnetworkarchitecturesandtheonesaftermorphing. Ascanbeseen, the morphedonesonlyhavealittlemoreparametersthantheoriginalones,typicallylessthan1.2x. Exceptforlargeerrorratereductionachievedbythemorphednetwork,oneexcitingindicationfrom Table 1 is that the morphed 20-layered network morph20 3c3 is able to achieve slightly lower errorratethanthe110-layeredResNet(6.60%vs6.61%),anditscomputationalcostisactuallyless than 1/5 of the latter one. Similar results have also been observed from the morphed 56-layered network. It is able to achieve a 5.37% error rate, which is even lower than those of ResNet-110 (6.61%)andResNet-164(5.46%)(Heetal.,2016). TheseresultsarealsoillustratedinFig. 5(a). Severaldifferentarchitecturesofthemorphednetworkswerealsoexplored,asillustratedinFig. 4 and Table 1. First, when the kernel sizes were expanded from 1×1 to 3×3, the morphed net- works (morph20 3c1 and morph20 3c3) achieved better performances. Similar results were reported in (Simonyan & Zisserman, 2014) (Table 1 for models C and D). However, because the morphed networks almost double the computational cost, we did not adopt this approach. Sec- ond, we also tried to morph the other scaled identity mapping layer into two convolutional layers (morph20 1c1 2branch),theerrorratewasfurtherloweredforthe20-layerednetwork. How- ever,forthe56-layeredand110-layerednetworks,thisstrategydidnotyieldbetterresults. We also found that the morphed network learned with multiple phases could achieve a lower error rate than that learned with single phase. For example, the networks morph20 3c1 and morph20 3c3 learned with intermediate phases achieved better results in Table 1. This is quite reasonable as it divides the optimization problem into sequential phases, and thus is possible to avoid being trapped into a local minimum to some extent. Inspired by this observation, we then useda1c1 halfnetworkasanintermediatephaseforthemorph56 1c1andmorph110 1c1 networks,andbetterresultshavebeenachieved. Finally,wecomparedtheproposedlearningschemeagainstlearningfromscratchforthenetworks withthesamearchitectures.TheseresultsareillustratedinTable2.Ascanbeseen,networkslearned bymorphingisabletoachieveupto2.66%absoluteperformanceimprovementand32.6%relative performance improvement comparing against learning from scratch for the morph110 1c1 net- workarchitecture. Theseresultsarequitereasonableaswhennetworksarelearnedbytheproposed morphingscheme,theyhavealreadybeenregularizedandshallhavelowerprobabilitytobetrapped intoabad-performinglocalminimuminthecontinualtrainingprocessthanthelearningfromscratch scheme. One may also notice that, morph110 1c1 actually performed worse than resnet110 when learned from scratch. This is because the network architecture morph 1c1 is proposed for morphing, and the identity shortcut connection is scaled with a factor of 0.5. It was also reported thatresidualnetworkswithaconstantscalingfactorof0.5actuallyledtoaworseperformancein (He et al., 2016) (Table 1), while this performance degradation problem could be avoided by the proposedmorphingscheme. 4.3 EXPERIMENTALRESULTSONTHECIFAR100DATASET CIFAR100(Krizhevsky&Hinton,2009)isanotherbenchmarkdatasetfortinyimagesthatconsists of100categories. Thereare500trainingimagesand100testingimagespercategory. Theproposed 10