Table Of ContentUnderreviewasaconferencepaperatICLR2017
MODULARIZED MORPHING OF NEURAL NETWORKS
TaoWei† ChanghuWang ChangWenChen
UniversityatBuffalo MicrosoftResearch UniversityatBuffalo
Buffalo,NY14260 Beijing,China,100080 Buffalo,NY14260
taowei@buffalo.edu chw@microsoft.com chencw@buffalo.edu
ABSTRACT
7 In this work we study the problem of network morphism, an effective learning
1 scheme to morph a well-trained neural network to a new one with the network
0 functioncompletelypreserved. Differentfromexistingworkwherebasicmorph-
2 ingtypesonthelayerlevelwereaddressed,wetargetatthecentralproblemofnet-
n workmorphismatahigherlevel,i.e.,howaconvolutionallayercanbemorphed
a intoanarbitrarymoduleofaneuralnetwork. Tosimplifytherepresentationofa
J network,weabstractamoduleasagraphwithblobsasverticesandconvolutional
2 layersasedges,basedonwhichthemorphingprocessisabletobeformulatedasa
1 graphtransformationproblem.Twoatomicmorphingoperationsareintroducedto
composethegraphs,basedonwhichmodulesareclassifiedintotwofamilies,i.e.,
] simplemorphablemodulesandcomplexmodules. Wepresentpracticalmorphing
G
solutionsforbothofthesetwofamilies,andprovethatanyreasonablemodulecan
L
bemorphedfromasingleconvolutionallayer. Extensiveexperimentshavebeen
s. conducted based on the state-of-the-art ResNet on benchmark datasets, and the
c effectivenessoftheproposedsolutionhasbeenverified.
[
1
v 1 INTRODUCTION
1
8
Deep convolutional neural networks have continuously demonstrated their excellent performances
2
ondiversecomputervisionproblems. Inimageclassification,themilestonesofsuchnetworkscan
3
beroughlyrepresentedbyLeNet(LeCunetal.,1989),AlexNet(Krizhevskyetal.,2012),VGGnet
0
(Simonyan&Zisserman,2014), GoogLeNet(Szegedyetal.,2014), andResNet(Heetal.,2015),
.
1 withnetworksbecomingdeeperanddeeper. However,thearchitecturesofthesenetworkaresignifi-
0
cantlyalteredandhencearenotbackward-compatible. Consideringalife-longlearningsystem,itis
7
highlydesiredthatthesystemisabletoupdateitselffromtheoriginalversionestablishedinitially,
1
andthenevolveintoamorepowerfulone,ratherthanre-learningabrandnewonefromscratch.
:
v
i Network morphism (Wei et al., 2016) is an effective way towards such an ambitious goal. It can
X morph a well-trained network to a new one with the knowledge entirely inherited, and hence is
r abletoupdatetheoriginalsystemtoacompatibleandmorepowerfulonebasedonfurthertraining.
a
Networkmorphismisalsoaperformanceboosterandarchitectureexplorerforconvolutionalneural
networks, allowingustoquicklyinvestigatenewmodelswithsignificantlylesscomputationaland
human resources. However, the network morphism operations proposed in (Wei et al., 2016), in-
cludingdepth,width,andkernelsizechanges,arequiteprimitiveandhavebeenlimitedtothelevel
oflayerinanetwork. Forpracticalapplicationswhereneuralnetworksusuallyconsistofdozensor
evenhundredsoflayers,themorphingspacewouldbetoolargeforresearcherstopracticallydesign
the architectures of target morphed networks, when based on these primitive morphing operations
only.
Differentfrompreviouswork,weinvestigateinthisresearchthenetworkmorphismfromahigher
levelofviewpoint,andsystematicallystudythecentralproblemofnetworkmorphismonthemod-
ule level, i.e., whether and how a convolutional layer can be morphed into an arbitrary module1,
whereamodulereferstoasingle-source,single-sinkacyclicsubnetofaneuralnetwork. Withthis
†TaoWeiperformedthisworkwhilebeinganinternatMicrosoftResearchAsia.
1Althoughnetworkmorphismgenerallydoesnotimposeconstraintsonthearchitectureofthechildnetwork,
inthisworkwelimittheinvestigationtotheexpandingmode.
1
UnderreviewasaconferencepaperatICLR2017
modularizednetworkmorphing, insteadofmorphinginthelayerlevelwherenumerousvariations
existinadeepneuralnetwork,wefocusonthechangesofbasicmodulesofnetworks,andexplore
themorphingspaceinamoreefficientway. Thenecessitiesforthisstudyaretwofolds. First,we
wish to explore the capability of the network morphism operations and obtain a theoretical upper
boundforwhatweareabletodowiththislearningscheme. Second,modernstate-of-the-artconvo-
lutionalneuralnetworkshavebeendevelopedwithmodularizedarchitectures(Szegedyetal.,2014;
Heetal.,2015),whichstacktheconstructionunitsfollowingthesamemoduledesign. Itishighly
desiredthatthemorphingoperationscouldbedirectlyappliedtothesenetworks.
To study the morphing capability of network morphism and figure out the morphing process, we
introduceasimplifiedgraph-basedrepresentationforamodule.Thus,thenetworkmorphingprocess
canbeformulatedasagraphtransformationprocess. Inthisrepresentation,themoduleofaneural
networkisabstractedasadirectedacyclicgraph(DAG),withdatablobsinthenetworkrepresented
asverticesandconvolutionallayersasedges. Furthermore,avertexwithmorethanoneoutdegree
(orindegree)implicitlyincludesasplitofmultiplecopiesofblobs(orajointofaddition). Indeed,
the proposed graph abstraction suffers from the problem of dimension compatibility of blobs, for
different kernel filters may result in totally different blob dimensions. We solve this problem by
extendingtheblobandfilterdimensionsfromfinitetoinfinite,andtheconvergencepropertieswill
alsobecarefullyinvestigated.
Two atomic morphing operations are adopted as the basis for the proposed graph transformation,
based on which a large family of modules can be transformed from a convolutional layer. This
familyofmodulesarecalledsimplemorphablemodulesinthiswork.Anovelalgorithmisproposed
to identify the morphing steps by reducing the module into a single convolutional layer. For any
moduleoutsidethesimplemorphablefamily,i.e.,complexmodule,wefirstapplythesamereduction
process and reduce it to an irreducible module. A practical algorithm is then proposed to solve
for the network morphism equation of the irreducible module. Therefore, we not only verify the
morphabilitytoanarbitrarymodule,butalsoprovideaunifiedmorphingsolution.Thisdemonstrates
thegeneralizationabilityandthuspracticalityofthislearningscheme.
ExtensiveexperimentshavebeenconductedbasedonResNet(Heetal.,2015)toshowtheeffective-
nessoftheproposedmorphingsolution. Withonly1.2xorlesscomputation,themorphednetwork
can achieve up to 25% relative performance improvement over the original ResNet. Such an im-
provementissignificantinthesensethatthemorphed20-layerednetworkisabletoachieveanerror
rate of 6.60% which is even better than a 110-layered ResNet (6.61%) on the CIFAR10 dataset,
withonlyaround1/5ofthecomputationalcost. Itisalsoexcitingthatthemorphed56-layerednet-
workisabletoachieve5.37%errorrate,whichisevenlowerthanthoseofResNet-110(6.61%)and
ResNet-164(5.46%). Theeffectivenessoftheproposedlearningschemehasalsobeenverifiedon
theCIFAR100andImageNetdatasets.
2 RELATED WORK
KnowledgeTransfer. Networkmorphismoriginatedfromknowledgetransferringforconvolutional
neuralnetworks. Earlyattemptswereonlyabletotransferpartialknowledgeofawell-trainednet-
work. Forexample,aseriesofmodelcompressiontechniques(Buciluetal.,2006;Ba&Caruana,
2014; Hinton et al., 2015; Romero et al., 2014) were proposed to fit a lighter network to predict
the output of a heavier network. Pre-training (Simonyan & Zisserman, 2014) was adopted to pre-
initializecertainlayersofadeepernetworkwithweightslearnedfromashallowernetwork. How-
ever,networkmorphismrequirestheknowledgebeingfullytransferred,andexistingworkincludes
Net2Net(Chenetal.,2015)andNetMorph(Weietal.,2016).Net2Netachievedthisgoalbypadding
identitymappinglayersintotheneuralnetwork,whileNetMorphdecomposedaconvolutionallayer
intotwolayersbydeconvolution. Notethatthenetworkmorphismoperationsin(Chenetal.,2015;
Wei et al., 2016) are quite primitive and at a micro-scale layer level. In this research, we study
thenetworkmorphismatameso-scalemodulelevel,andinparticular,weinvestigateitsmorphing
capability.
ModularizedNetworkArchitecture. Theevolutionofconvolutionalneuralnetworkshasbeenfrom
sequential to modularized. For example, LeNet (LeCun et al., 1998), AlexNet (Krizhevsky et al.,
2012), and VGG net (Simonyan & Zisserman, 2014) are sequential networks, and their difference
is primarily on the number of layers, which is 5, 8, and up to 19 respectively. However, recently
2
UnderreviewasaconferencepaperatICLR2017
𝑠 𝐺𝑙 𝑡
Conv
𝐵𝑖 𝐺 𝐵𝑗
𝑙
𝐹1
𝑙
𝑠 𝑡 𝑠 𝑡
𝐹2
Conv Conv 𝑙
𝐵𝑖 𝐹 𝐵𝑙 𝐹 𝐵𝑗
𝑙 𝑙+1
TYPE-I TYPE-II
(a)Networkmorphismindepth. (b)Atomicmorphingtypes.
Figure 1: Illustration of atomic morphing types. (a) One convolutional layer is morphed into two
convolutionallayers;(b)TYPE-IandTYPE-IIatomicmorphingtypes.
proposednetworks,suchasGoogLeNet(Szegedyetal.,2014;2015)andResNet(Heetal.,2015),
followamodularizedarchitecturedesign,andhaveachievedthestate-of-the-artperformance. This
is why we wish to study network morphism at the module level, so that its operations are able to
directlyapplytothesemodularizednetworkarchitectures.
3 NETWORK MORPHISM VIA GRAPH ABSTRACTION
In this section, we present a systematic study on the capability of network morphism learning
scheme. We shall verify that a convolutional layer is able to be morphed into any single-source,
single-sinkDAGsubnet,namedasamodulehere. Weshallalsopresentthecorrespondingmorph-
ingalgorithms.
Forsimplicity,wefirstconsiderconvolutionalneuralnetworkswithonlyconvolutionallayers. All
other layers, including the non-linearity and batch normalization layers, will be discussed later in
thispaper.
3.1 BACKGROUNDANDBASICNOTATIONS
Fora2Ddeepconvolutionalneuralnetwork(DCNN),asshowninFig.1a,theconvolutionisdefined
by:
(cid:88)
B (c )= B (c )∗G (c ,c ), (1)
j j i i l j i
ci
where the blob B is a 3D tensor of shape (C ,H ,W ) and the convolutional filter G is a 4D
∗ ∗ ∗ ∗ l
tensor of shape (C ,C ,K ,K ). In addition, C , H , and W represent the number of channels,
j i l l ∗ ∗ ∗
heightandwidthofB . K istheconvolutionalkernelsize2.
∗ l
In a network morphism process, the convolutional layer G in the parent network is morphed into
l
two convolutional layers F and F (Fig. 1a), where the filters F and F are 4D tensors of
l l+1 l l+1
shapes(C ,C ,K ,K )and(C ,C ,K ,K ). Thisprocessshouldfollowthemorphismequation:
l i 1 1 j l 2 2
G˜(c ,c )=(cid:88)F (c ,c )∗F (c ,c ), (2)
l j i l l i l+1 j l
cl
whereG˜ isazero-paddedversionofG whoseeffectivekernelsizeisK˜ = K +K −1 ≥ K .
l l l 1 2 l
(Weietal.,2016)showedasufficientconditionforexactnetworkmorphism:
max(C C K2,C C K2)≥C C (K +K −1)2. (3)
l i 1 j l 2 j i 1 2
Forsimplicity,weshalldenoteequations(1)and(2)asB =G (cid:126)B andG˜ =F (cid:126)F ,where
j l i l l+1 l
(cid:126)isanon-communicativemulti-channelconvolutionoperator. Wecanalsorewriteequation(3)as
max(|F |,|F |)≥|G˜|,where|∗|measuresthesizeoftheconvolutionalfilter.
l l+1 l
2Generallyspeaking,G isa4Dtensorofshape(C ,C ,KH,KW),whereconvolutionalkernelsizesfor
l j i l l
blobheightandwidtharenotnecessarytobethesame.However,inordertosimplifythenotations,weassume
thatKH =KW,buttheclaimsandtheoremsinthispaperapplyequallywellwhentheyaredifferent.
l l
3
UnderreviewasaconferencepaperatICLR2017
3.2 ATOMICNETWORKMORPHISM
Westartwiththesimplestcases.Twoatomicmorphingtypesareconsidered,asshowninFig.1b:1)
aconvolutionallayerismorphedintotwoconvolutionallayers(TYPE-I);2)aconvolutionallayeris
morphedintotwo-wayconvolutionallayers(TYPE-II).FortheTYPE-Iatomicmorphingoperation,
equation(2)issatisfied,whileForTYPE-II,thefiltersplitissettosatisfy
G =F1+F2. (4)
l l l
Inaddition,forTYPE-II,atthesourceend,theblobissplitwithmultiplecopies;whileatthesink
end,theblobsarejoinedbyaddition.
3.3 GRAPHABSTRACTION
Tosimplifytherepresentation,weintroducethefollowinggraphabstractionfornetworkmorphism.
Foraconvolutionalneuralnetwork,weareabletoabstractitasagraph,withtheblobsrepresented
byvertices, andconvolutionallayersbyedges. Formally, aDCNNisrepresentedasaDAGM =
(V,E),whereV = {B }N areblobsandE = {e = (B ,B )}L areconvolutionallayers. Each
i i=1 l i j l=1
convolutional layer e connects two blobs B and B , and is associated with a convolutional filter
l i j
F . Furthermore,inthisgraph,ifoutdegree(B )>1,itimplicitlymeansasplitofmultiplecopies;
l i
andifindegree(B )>1,itisajointofaddition.
i
Basedonthisabstraction,weformallyintroducethefollowingdefinitionformodularnetworkmor-
phism:
Definition1. LetM =({s,t},e )representthegraphwithonlyasingleedgee thatconnectsthe
0 0 0
sourcevertexsandsinkvertext. M = (V,E)representsanysingle-source,single-sinkDAGwith
thesamesourcevertexsandthesamesinkvertext. WecallsuchanM asamodule. Ifthereexists
a process that we are able to morph M to M, then we say that module M is morphable, and the
0
morphingprocessiscalledmodularnetworkmorphism.
Hence, basedonthisabstraction, modularnetworkmorphismcanberepresentedasagraphtrans-
formationproblem. AsshowninFig. 2b, module(C)inFig. 2acanbetransformedfrommodule
M byapplyingtheillustratednetworkmorphismoperations.
0
Foreachmodularnetworkmorphing,amodularnetworkmorphismequationisassociated:
Definition2. Eachmoduleessentiallycorrespondstoafunctionfromstot,whichiscalledamodule
function. ForamodularnetworkmorphismprocessfromM toM,theequationthatguaranteesthe
0
modulefunctionunchangediscalledmodularnetworkmorphismequation.
It is obvious that equations (2) and (4) are the modular network morphism equations for TYPE-I
and TYPE-II atomic morphing types. In general, the modular network morphism equation for a
module M is able to be written as the sum of all convolutional filter compositions, in which each
composition is actually a path from s to t in the module M. Let {(F ,F ,··· ,F ) : p =
p,1 p,2 p,ip
1,··· ,P, andi isthelengthofpathp}bethesetofallsuchpathsrepresentedbytheconvolutional
p
filters. ThenthemodularnetworkmorphismequationformoduleM canbewrittenas
(cid:88)
G = F (cid:126)F (cid:126)···(cid:126)F . (5)
l p,ip p,ip−1 p,1
p
As an example illustrated in Fig. 2a, there are four paths in module (D), and its modular network
morphismequationcanbewrittenas
G =F (cid:126)F +F (cid:126)(F (cid:126)F +F (cid:126)F )+F (cid:126)F , (6)
l 5 1 6 3 1 4 2 7 2
whereG istheconvolutionalfilterassociatedwithe inmoduleM .
l 0 0
3.4 THECOMPATIBILITYOFNETWORKMORPHISMEQUATION
Onedifficultyinthisgraphabstractionisinthedimensionalcompatibilityofconvolutionalfiltersor
blobs. Forexample,fortheTYPE-IIatomicmorphinginFig. 1b,wehavetosatisfyG =F1+F2.
l l l
SupposethatG andF2 areofshape(64,64,3,3),whileF1 is(64,64,1,1),theyareactuallynot
l l l
addable. Formally,wedefinethecompatibilityofmodularnetworkmorphismequationasfollows:
4
UnderreviewasaconferencepaperatICLR2017
𝑠 𝑡 𝑠 𝑡
𝑠 𝑡 𝑠 𝑡 𝑠 𝑡
(A) (B) Module (C)
𝑠 𝑡 𝑠𝐹1 𝐹3 𝐹6 𝐹5 𝑡 𝑠 𝑡 𝑠 𝑡 𝑠 𝑡 ?
𝐹4
𝐹2 𝐹7
(C) (D) Module (D)
(a)Examplemodules. (b)Morphingprocessformodule(C)and(D).
Figure 2: Example modules and morphing processes. (a) Modules (A)-(C) are simple morphable,
while(D)isnot; (b)amorphingprocessformodule(C),whileformodule(D),wearenotableto
findsuchaprocess.
Definition3. ThemodularnetworkmorphismequationforamoduleM iscompatibleifandonly
if the mathematical operators between the convolutional filters involved in this equation are well-
defined.
Inordertosolvethiscompatibilityproblem,weneednottoassumethatblobs{B }andfilters{F }
i l
arefinitedimensiontensors. Insteadtheyareconsideredasinfinitedimensiontensorsdefinedwitha
finitesupport3,andwecallthisasanextendeddefinition. Aninstantadvantagewhenweadoptthis
extendeddefinitionisthatwewillnolongerneedtodifferentiateG andG˜ inequation(2), since
l l
G˜ issimplyazero-paddedversionofG .
l l
Lemma4. Theoperations+and(cid:126)arewell-definedforthemodularnetworkmorphismequation.
Namely,ifF1 andF2 areinfinitedimension4Dtensorswithfinitesupport,letG = F1+F2 and
H =F2(cid:126)F1,thenbothGandH areuniquelydefinedandalsohavefinitesupport.
SketchofProof. Itisquiteobviousthatthislemmaholdsfortheoperator+. Fortheoperator(cid:126),if
wehavethisextendeddefinition,thesuminequation(2)willbecomeinfiniteovertheindexc . Itis
l
straightforwardtoshowthatthisinfinitesumconverges,andalsothatH isfinitelysupportedwith
respecttotheindicesc andc . HenceH hasfinitesupport.
j i
Asacorollary,wehave:
Corollary5. ThemodularnetworkmorphismequationforanymoduleM isalwayscompatibleif
thefiltersinvolvedinM areconsideredasinfinitedimensiontensorswithfinitesupport.
3.5 SIMPLEMORPHABLEMODULES
In this section, we introduce a large family of modules, i.e, simple morphable modules, and then
providetheirmorphingsolutions. Wefirstintroducethefollowingdefinition:
Definition 6. A module M is simple morphable if and only if it is able to be morphed with only
combinationsofatomicmorphingoperations.
Several example modules are shown in Fig. 2a. It is obvious that modules (A)-(C) are simple
morphable,andthemorphingprocessformodule(C)isalsoillustratedinFig. 2b.
For a simple morphable module M, we are able to identity a morphing sequence from M to
0
M. The algorithm is illustrated in Algorithm 1. The core idea is to use the reverse operations
of atomic morphing types to reduce M to M . Hence, the morphing process is just the reverse
0
of the reduction process. In Algorithm 1, we use a four-element tuple (M,e ,{e ,e },type) to
1 2 3
representtheprocessofmorphingedgee1 inmoduleM to{e2,e3}usingTYPE-<TYPE>atomic
operation. TwoauxiliaryfunctionsCHECKTYPEIandCHECKTYPEIIarefurtherintroduced. Both
3A support of a function is defined as the set of points where the function value is non-zero, i.e.,
support(f)={x|f(x)(cid:54)=0}.
5
UnderreviewasaconferencepaperatICLR2017
Algorithm1AlgorithmforSimpleMorphableModules
Input: M ;asimplemorphablemoduleM
0
Output: ThemorphingsequenceQthatmorphsM toM usingatomicmorphingoperations
0
Q=∅
whileM (cid:54)=M do
0
whileCHECKTYPEI(M)isnotFALSEdo
//Let(Mtemp,e1,{e2,e3},type)bethereturnvalueofCHECKTYPEI(M)
Q.prepend((M ,e ,{e ,e },type))andM ←M
temp 1 2 3 temp
endwhile
whileCHECKTYPEII(M)isnotFALSEdo
//Let(Mtemp,e1,{e2,e3},type)bethereturnvalueofCHECKTYPEII(M)
Q.prepend((M ,e ,{e ,e },type))andM ←M
temp 1 2 3 temp
endwhile
endwhile
Algorithm2AlgorithmforIrreducibleModules
Input: G ;anirreduciblemoduleM
l
Output: Convolutionalfilters{F }n ofM
i i=1
Initialize{F }n withrandomnoise.
i i=1
CalculatetheeffectivekernelsizeofM,expandG toG˜ bypaddingzeros.
l l
repeat
for j =1tondo
Fix{F :i(cid:54)=j},andcalculateF =deconv(G˜,{F :i(cid:54)=j})
i j l i
Calculatelossl=(cid:107)G˜ −conv({F }n )(cid:107)2
l i i=1
endfor
untill=0ormaxIterisreached
of them return either FALSE if there is no such atomic sub-module in M, or a morphing tuple
(M,e1,{e2,e3},type)ifthereis. Thealgorithmof CHECKTYPEI onlyneedstofindavertexsat-
isfyingindegree(Bi) = outdegree(Bi) = 1,while CHECKTYPEII looksforthematrixelements
>1intheadjacentmatrixrepresentationofmoduleM.
Isthereamodulenotsimplemorphable? Theanswerisyes, andanexampleisthemodule(D)in
Fig. 2a. AsimpletrydoesnotworkasshowninFig. 2b. Infact,wehavethefollowingproposition:
Proposition7. Module(D)inFig. 2aisnotsimplemorphable.
SketchofProof. AsimplemorphablemoduleM isalwaysabletoberevertedbacktoM .However,
0
formodule(D)inFig. 2a,bothCHECKTYPEIandCHECKTYPEIIreturnFALSE.
3.6 MODULARNETWORKMORPHISMTHEOREM
Foramodulethatisnotsimplemorphable,whichiscalledacomplexmodule,weareabletoapply
Algorithm 1 to reduce it to an irreducible module M first. For M, we propose Algorithm 2 to
solve the modular network morphism equation. The core idea of this algorithm is that, if only
oneconvolutionalfilterisallowedtochangewithallothersfixed, themodularnetworkmorphism
equation will reduce to a linear system. The following argument guarantees the correctness of
Algorithm2.
CorrectnessofAlgorithm2. Let G and {F }n be the convolutional filter(s) associated with M
l i i=1 0
andM. Wefurtherassumethatoneof{F },e.g.,F ,islargerorequaltoG˜,whereG˜ isthezero-
i j l l
padded version of G (this assumption is a strong condition in the expanding mode). The module
l
networkmorphismequationforM canbewrittenas
G˜ =C (cid:126)F (cid:126)C +C , (7)
l 1 j 2 3
6
UnderreviewasaconferencepaperatICLR2017
3x3 3x3
BN ReLU BN
Conv Conv
𝐼𝑑
Input + ReLU Output
(a) ResNetmodule:
3x3 3x3
BN ReLU BN
Conv Conv
0.5𝐼𝑑
Input + ReLU Output
1x1 1x1
BN PReLU BN
Conv Conv
(b) morph_1c1 module:
Figure3: DetailedarchitecturesoftheResNetmoduleandthemorph 1c1module.
3x3 3x3 3x3 3x3 3x3 3x3 3x3 3x3 3x3 3x3
0.5 0.5 0.5 1x1 1x1
1x1 1x1 3x3 1x1 3x3 3x3 1x1 1x1
(a) ResNet (b) morph_1c1 (c) morph_3c1 (d) morph_3c3 (e) morph_1c1_2branch
Figure4: Samplemodulesadoptedintheproposedexperiments. (a)and(b)arethegraphabstrac-
tionsofmodulesillustratedinFig. 3(a)and(b).
whereC ,C ,andC arecomposedofotherfilters{F : i (cid:54)= j}. Itcanbecheckedthatequation
1 2 3 i
(7)isalinearsystemwith|G˜|constraintsand|F |freevariables. Sincewehave|F | ≥ |G˜|,the
l j j l
systemisnon-deterministicandhencesolvableasrandommatricesarerarelyinconsistent.
ForageneralmoduleM,whethersimplemorphableornot,weapplyAlgorithm1toreduceM toan
irreduciblemoduleM(cid:48),andthenapplyAlgorithm2toM(cid:48). Hencewehavethefollowingtheorem:
Theorem 8. A convolutional layer can be morphed to any module (any single-source, single-sink
DAGsubnet).
This theorem answers the core question of network morphism, and provides a theoretical upper
boundforthecapabilityofthislearningscheme.
3.7 NON-LINEARITYANDBATCHNORMALIZATIONINMODULARNETWORKMORPHISM
Besidestheconvolutionallayers,aneuralnetworkmoduletypicallyalsoinvolvesnon-linearitylay-
ersandbatchnormalizationlayers,asillustratedinFig. 3. Inthissection,weshalldescribehowdo
wehandletheselayersformodularnetworkmorphism.
For the non-linear activation layers, we adopt the solution proposed in (Wei et al., 2016). Instead
of directly applying the non-linear activations, we are using their parametric forms. Let ϕ be any
non-linearactivationfunction,anditsparametricformisdefinedtobe
P-ϕ={ϕa}| ={(1−a)·ϕ+aϕ }| . (8)
a∈[0,1] id a∈[0,1]
Theshapesof theparametricformofthenon-linear activationϕ iscontrolledby theparametera.
Whenaisinitialized(a=1),theparametricformisequivalenttoanidentityfunction,andwhenthe
valueofahasbeenlearned,theparametricformwillbecomeanon-linearactivation. InFig. 3b,the
non-linearactivationforthemorphingprocessisannotatedasPReLUtodifferentiateitselfwiththe
7
UnderreviewasaconferencepaperatICLR2017
Table 1: Experimental results of networks morphed from ResNet-20, ResNet-56, and ResNet-110
ontheCIFAR10dataset. Resultsannotatedwith†arefrom(Heetal.,2015).
Intermediate Abs.Perf. Rel.Perf. #Params. #Params.
NetArch. Error FLOP(million) Rel.FLOP
Phases Improv. Improv. (MB) Rel.
resnet20† - 8.75% - - 1.048 1x 40.8 1x
morph201c1 - 7.35% 1.40% 16.0% 1.138 1.09x 44.0 1.08x
- 7.10% 1.65% 18.9%
morph203c1 1.466 1.40x 56.5 1.38x
1c1 6.83% 1.92% 21.9%
- 6.97% 1.78% 20.3%
morph203c3 1.794 1.71x 69.1 1.69x
1c1,3c1 6.66% 2.09% 23.9%
- 7.26% 1.49% 17.0%
morph201c12branch 1.227 1.17x 47.1 1.15x
1c1,half 6.60% 2.15% 24.6%
resnet56† - 6.97% - - 3.289 1x 125.7 1x
morph561c1half - 5.68% 1.29% 18.5% 3.468 1.05x 132.0 1.05x
- 5.94% 1.03% 14.8%
morph561c1 3.647 1.11x 138.3 1.10x
1c1half 5.37% 1.60% 23.0%
resnet110† - 6.61%±0.16 - - 6.649 1x 253.1 1x
morph1101c1half - 5.74% 0.87% 13.2% 7.053 1.06x 267.3 1.06x
- 5.93% 0.68% 10.3%
morph1101c1 7.412 1.11x 279.9 1.11x
1c1half 5.50% 1.11% 16.8%
Table 2: Comparison results between learning from morphing and learning from scratch for the
samenetworkarchitecturesontheCIFAR10dataset.
Error Error Abs.Perf. Rel.Perf.
NetArch.
(scratch) (morph) Improv. Improv.
morph20 1c1 8.01% 7.35% 0.66% 8.2%
morph20 1c1 2branch 7.90% 6.60% 1.30% 16.5%
morph56 1c1 7.37% 5.37% 2.00% 27.1%
morph110 1c1 8.16% 5.50% 2.66% 32.6%
other ReLU activations. In the proposed experiments, for simplicity, all ReLUs are replaced with
PReLUs.
Thebatchnormalizationlayers(Ioffe&Szegedy,2015)canberepresentedas
data−mean
newdata= √ ·gamma+beta. (9)
var+eps
√
Itisobviousthatifwesetgamma = var+epsandbeta = mean,thenabatchnormalization
layerisreducedtoanidentitymappinglayer,andhenceitcanbeinsertedanywhereinthenetwork.
Although it is possible to calculate the values of gamma and beta from the training data, in this
research, we adopt another simpler approach by setting gamma = 1 and beta = 0. In fact, the
valueofgammacanbesettoanynonzeronumber,sincethescaleisthennormalizedbythelatter
batchnormalizationlayer(lowerrightoneinFig. 3b). Mathematicallyandstrictlyspeaking,when
wesetgamma = 0,thenetworkfunctionisactuallychanged. However,sincethemorphedfilters
fortheconvolutionallayersareroughlyrandomized, eventhoughthemeanofdataisnotstrictly
zero, itisstillapproximatelyzero. Pluswiththefactthatthedataisthennormalizedbythelatter
batchnormalizationlayer,suchsmallperturbationforthenetworkfunctionchangecanbeneglected.
Intheproposedexperiments,onlystatisticalvariancesinperformanceareobservedforthemorphed
networkwhenweadoptsettinggammatozero.Thereasonweprefersuchanapproachtousingthe
trainingdataisthatitiseasiertoimplementandalsoyieldsslightlybetterresultswhenwecontinue
totrainthemorphednetwork.
4 EXPERIMENTAL RESULTS
Inthissection,wereporttheresultsoftheproposedmorphingalgorithmsbasedoncurrentstate-of-
the-artResNet(Heetal.,2015),whichisthewinnerof2015ImageNetclassificationtask.
8
UnderreviewasaconferencepaperatICLR2017
Table 3: Experimental results of networks morphed from ResNet-20, ResNet-56, and ResNet-110
ontheCIFAR100dataset.
Intermediate Abs.Perf. Rel.Perf. #Params. #Params.
NetArch. Error FLOP(million) Rel.FLOP
Phases Improv. Improv. (MB) Rel.
resnet20 - 32.82% - - 1.070 1x 40.8 1x
morph20 1c1 - 31.70% 1.12% 3.4% 1.160 1.08x 44.0 1.08x
resnet56 - 29.83% - - 3.311 1x 125.8 1x
morph56 1c1 1c1 half 27.52% 2.31% 7.7% 3.670 1.11x 138.3 1.10x
resnet110 - 28.46% - - 6.672 1x 253.2 1x
morph110 1c1 1c1 half 26.81% 1.65% 5.8% 7.434 1.11x 279.9 1.11x
Table 4: Comparison results between learning from morphing and learning from scratch for the
samenetworkarchitecturesontheCIFAR100dataset.
Error Error Abs.Perf. Rel.Perf.
NetArch.
(scratch) (morph) Improv. Improv.
morph20 1c1 33.63% 31.70% 1.93% 5.7%
morph56 1c1 32.58% 27.52% 5.06% 15.5%
morph110 1c1 31.94% 26.81% 5.13% 16.1%
4.1 NETWORKARCHITECTURESOFMODULARNETWORKMORPHISM
We first introduce the network architectures used in the proposed experiments. Fig. 3a shows the
module template in the design of ResNet (He et al., 2015), which is actually a simple morphable
two-way module. The first path consists of two convolutional layers, and the second path is a
shortcutconnectionofidentitymapping. ThearchitectureoftheResNetmodulecanbeabstracted
as the graph in Fig. 4a. For the morphed networks, we first split the identity mapping layer in
the ResNet module into two layers with a scaling factor of 0.5. Then each of the scaled identity
mappinglayersisabletobefurthermorphedintotwoconvolutionallayers. Fig. 3billustratesthe
case with only one scaled identity mapping layer morphed into two convolutional layers, and its
equivalentgraphabstractionisshowninFig. 4b. Todifferentiatenetworkarchitecturesadoptedin
this research, the notation morph <k1>c<k2> is introduced, where k1 and k2 are kernel sizes
in the morphed network. If both of scaled identity mapping branches are morphed, we append a
suffixof‘ 2branch’. SomeexamplesofmorphedmodulesareillustratedinFig. 4. Wealsouse
thesuffix‘ half’toindicatethatonlyonehalf(odd-indexed)ofthemodulesaremorphed,andthe
otherhalfareleftasoriginalResNetmodules.
resnet morphnet resnet morphnet
10 34
32.82
9 8.75
%) 8 %)32 31.70
ate ( 7 6.60 6.97 6.61 ate (30 29.83 28.46
error r 56 5.37 5.50 error r2286 27.52 26.81
4
3 24
20-layer 56-layer 110-layer 20-layer 56-layer 110-layer
(a)CIFAR10 (b)CIFAR100
Figure 5: Comparison results of ResNet and morphed networks on the CIFAR10 and CIFAR100
datasets.
9
UnderreviewasaconferencepaperatICLR2017
4.2 EXPERIMENTALRESULTSONTHECIFAR10DATASET
CIFAR10(Krizhevsky&Hinton,2009)isabenchmarkdatasetonimageclassificationandneural
network investigation. It consists of 32×32 color images in 10 categories, with 50,000 training
imagesand10,000testingimages. Inthetrainingprocess,wefollowthesamesetupasin(Heetal.,
2015). Weuseadecayof0.0001andamomentumof0.9. Weadoptthesimpledataaugmentation
withapadof4pixelsoneachsideoftheoriginalimage. A32×32viewisrandomlycroppedfrom
thepaddedimageandarandomhorizontalflipisoptionallyapplied.
Table1showstheresultsofdifferentnetworksmorphedfromResNet(Heetal.,2015).Noticethatit
isverychallengingtofurtherimprovetheperformance,forResNethasalreadyboostedthenumber
to a very high level. E.g., ResNet (He et al., 2015) made only 0.36% performance improvement
byextendingthemodelfrom56to110layers(Table1). FromTable1wecanseethat, withonly
1.2xorlesscomputationalcost,themorphednetworksachieved2.15%,1.60%,1.11%performance
improvements over the original ResNet-20, ResNet-56, and ResNet-110 respectively. Notice that
the relative performance improvement can be up to 25%. Table 1 also compares the number of
parametersoftheoriginalnetworkarchitecturesandtheonesaftermorphing. Ascanbeseen, the
morphedonesonlyhavealittlemoreparametersthantheoriginalones,typicallylessthan1.2x.
Exceptforlargeerrorratereductionachievedbythemorphednetwork,oneexcitingindicationfrom
Table 1 is that the morphed 20-layered network morph20 3c3 is able to achieve slightly lower
errorratethanthe110-layeredResNet(6.60%vs6.61%),anditscomputationalcostisactuallyless
than 1/5 of the latter one. Similar results have also been observed from the morphed 56-layered
network. It is able to achieve a 5.37% error rate, which is even lower than those of ResNet-110
(6.61%)andResNet-164(5.46%)(Heetal.,2016). TheseresultsarealsoillustratedinFig. 5(a).
Severaldifferentarchitecturesofthemorphednetworkswerealsoexplored,asillustratedinFig. 4
and Table 1. First, when the kernel sizes were expanded from 1×1 to 3×3, the morphed net-
works (morph20 3c1 and morph20 3c3) achieved better performances. Similar results were
reported in (Simonyan & Zisserman, 2014) (Table 1 for models C and D). However, because the
morphed networks almost double the computational cost, we did not adopt this approach. Sec-
ond, we also tried to morph the other scaled identity mapping layer into two convolutional layers
(morph20 1c1 2branch),theerrorratewasfurtherloweredforthe20-layerednetwork. How-
ever,forthe56-layeredand110-layerednetworks,thisstrategydidnotyieldbetterresults.
We also found that the morphed network learned with multiple phases could achieve a lower
error rate than that learned with single phase. For example, the networks morph20 3c1 and
morph20 3c3 learned with intermediate phases achieved better results in Table 1. This is quite
reasonable as it divides the optimization problem into sequential phases, and thus is possible to
avoid being trapped into a local minimum to some extent. Inspired by this observation, we then
useda1c1 halfnetworkasanintermediatephaseforthemorph56 1c1andmorph110 1c1
networks,andbetterresultshavebeenachieved.
Finally,wecomparedtheproposedlearningschemeagainstlearningfromscratchforthenetworks
withthesamearchitectures.TheseresultsareillustratedinTable2.Ascanbeseen,networkslearned
bymorphingisabletoachieveupto2.66%absoluteperformanceimprovementand32.6%relative
performance improvement comparing against learning from scratch for the morph110 1c1 net-
workarchitecture. Theseresultsarequitereasonableaswhennetworksarelearnedbytheproposed
morphingscheme,theyhavealreadybeenregularizedandshallhavelowerprobabilitytobetrapped
intoabad-performinglocalminimuminthecontinualtrainingprocessthanthelearningfromscratch
scheme. One may also notice that, morph110 1c1 actually performed worse than resnet110
when learned from scratch. This is because the network architecture morph 1c1 is proposed for
morphing, and the identity shortcut connection is scaled with a factor of 0.5. It was also reported
thatresidualnetworkswithaconstantscalingfactorof0.5actuallyledtoaworseperformancein
(He et al., 2016) (Table 1), while this performance degradation problem could be avoided by the
proposedmorphingscheme.
4.3 EXPERIMENTALRESULTSONTHECIFAR100DATASET
CIFAR100(Krizhevsky&Hinton,2009)isanotherbenchmarkdatasetfortinyimagesthatconsists
of100categories. Thereare500trainingimagesand100testingimagespercategory. Theproposed
10