ebook img

spectral normalization for generative adversarial networks PDF

26 Pages·2017·7.28 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview spectral normalization for generative adversarial networks

PublishedasaconferencepaperatICLR2018 SPECTRAL NORMALIZATION FOR GENERATIVE ADVERSARIAL NETWORKS TakeruMiyato1,ToshikiKataoka1,MasanoriKoyama2,YuichiYoshida3 {miyato, kataoka}@preferred.jp [email protected] [email protected] 1PreferredNetworks,Inc. 2RitsumeikanUniversity3NationalInstituteofInformatics ABSTRACT Oneofthechallengesinthestudyofgenerativeadversarialnetworksistheinsta- bilityofitstraining. Inthispaper,weproposeanovelweightnormalizationtech- nique called spectral normalization to stabilize the training of the discriminator. Ournewnormalizationtechniqueiscomputationallylightandeasytoincorporate intoexistingimplementations. Wetestedtheefficacyofspectralnormalizationon CIFAR10, STL-10, andILSVRC2012dataset, andweexperimentallyconfirmed thatspectrallynormalizedGANs(SN-GANs)iscapableofgeneratingimagesof better or equal quality relative to the previous training stabilization techniques. ThecodewithChainer(Tokuietal.,2015),generatedimagesandpretrainedmod- els are available at https://github.com/pfnet-research/sngan_ projection. 1 INTRODUCTION Generativeadversarialnetworks(GANs)(Goodfellowetal.,2014)havebeenenjoyingconsiderable successasaframeworkofgenerativemodelsinrecentyears, andithasbeenappliedtonumerous typesoftasksanddatasets(Radfordetal.,2016;Salimansetal.,2016;Ho&Ermon,2016;Lietal., 2017). In a nutshell, GANs are a framework to produce a model distribution that mimics a given targetdistribution,anditconsistsofageneratorthatproducesthemodeldistributionandadiscrimi- natorthatdistinguishesthemodeldistributionfromthetarget. Theconceptistoconsecutivelytrain the model distribution and the discriminator in turn, with the goal of reducing the difference be- tweenthemodeldistributionandthetargetdistributionmeasuredbythebestdiscriminatorpossible ateachstepofthetraining. GANshavebeendrawingattentioninthemachinelearningcommunity notonlyforitsabilitytolearnhighlystructuredprobabilitydistributionbutalsoforitstheoretically interesting aspects. For example, (Nowozin et al., 2016; Uehara et al., 2016; Mohamed & Laksh- minarayanan,2017)revealedthatthetrainingofthediscriminatoramountstothetrainingofagood estimator for the density ratio between themodel distribution and the target. This is a perspective thatopensthedoortothemethodsofimplicitmodels(Mohamed&Lakshminarayanan,2017;Tran etal.,2017)thatcanbeusedtocarryoutvariationaloptimizationwithoutthedirectknowledgeof thedensityfunction. ApersistingchallengeinthetrainingofGANsistheperformancecontrolofthediscriminator. In high dimensional spaces, the density ratio estimation by the discriminator is often inaccurate and unstable during the training, and generator networks fail to learn the multimodal structure of the target distribution. Even worse, when the support of the model distribution and the support of the targetdistributionaredisjoint, thereexistsadiscriminatorthatcanperfectlydistinguishthemodel distribution from the target (Arjovsky & Bottou, 2017). Once such discriminator is produced in this situation, the training of the generator comes to complete stop, because the derivative of the so-produceddiscriminatorwithrespecttotheinputturnsouttobe0. Thismotivatesustointroduce someformofrestrictiontothechoiceofthediscriminator. In this paper, we propose a novel weight normalization method called spectral normalization that canstabilizethetrainingofdiscriminatornetworks. Ournormalizationenjoysfollowingfavorable properties. 1 PublishedasaconferencepaperatICLR2018 • Lipschitz constant is the only hyper-parameter to be tuned, and the algorithm does not requireintensivetuningoftheonlyhyper-parameterforsatisfactoryperformance. • Implementationissimpleandtheadditionalcomputationalcostissmall. In fact, our normalization method also functioned well even without tuning Lipschitz constant, which is the only hyper parameter. In this study, we provide explanations of the effectiveness of spectralnormalizationforGANsagainstotherregularizationtechniques,suchasweightnormaliza- tion(Salimans&Kingma,2016),weightclipping(Arjovskyetal.,2017),andgradientpenalty(Gul- rajanietal.,2017). Wealsoshowthat, intheabsenceofcomplimentaryregularizationtechniques (e.g., batch normalization, weight decay and feature matching on the discriminator), spectral nor- malizationcanimprovethesheerqualityofthegeneratedimagesbetterthanweightnormalization andgradientpenalty. 2 METHOD Inthissection,wewilllaythetheoreticalgroundworkforourproposedmethod. Letusconsidera simplediscriminatormadeofaneuralnetworkofthefollowingform,withtheinputx: f(x,θ)=WL+1a (WL(a (WL−1(...a (W1x)...)))), (1) L L−1 1 where θ := {W1,...,WL,WL+1} is the learning parameters set, Wl ∈ Rdl×dl−1, WL+1 ∈ R1×dL, and al is an element-wise non-linear activation function. We omit the bias term of each layerforsimplicity. Thefinaloutputofthediscriminatorisgivenby D(x,θ)=A(f(x,θ)), (2) whereAisanactivationfunctioncorrespondingtothedivergenceofdistancemeasureoftheuser’s choice. ThestandardformulationofGANsisgivenby minmaxV(G,D) G D where min and max of G and D are taken over the set of generator and discriminator func- tions, respectively. The conventional form of V(G,D) (Goodfellow et al., 2014) is given by E [logD(x)]+E [log(1−D(x(cid:48)))],whereq isthedatadistributionandp isthe x∼qdata x(cid:48)∼pG data G (model)generatordistributiontobelearnedthroughtheadversarialmin-maxoptimization. Theac- tivationfunctionAthatisusedintheD ofthisexpressionissomecontinuousfunctionwithrange [0,1](e.g,sigmoidfunction). Itisknownthat,forafixedgeneratorG,theoptimaldiscriminatorfor thisformofV(G,D)isgivenbyD∗(x):=q (x)/(q (x)+p (x)). G data data G Themachinelearningcommunityhasbeenpointingoutrecentlythatthefunctionspacefromwhich thediscriminatorsareselectedcruciallyaffectstheperformanceofGANs. Anumberofworks(Ue- haraetal.,2016;Qi,2017;Gulrajanietal.,2017)advocatetheimportanceofLipschitzcontinuity in assuring the boundedness of statistics. For example, the optimal discriminator of GANs on the abovestandardformulationtakestheform q (x) D∗(x)= data =sigmoid(f∗(x)),wheref∗(x)=logq (x)−logp (x), (3) G q (x)+p (x) data G data G anditsderivative 1 1 ∇ f∗(x)= ∇ q (x)− ∇ p (x) (4) x q (x) x data p (x) x G data G canbeunboundedorevenincomputable. Thispromptsustointroducesomeregularityconditionto thederivativeoff(x). A particularly successful works in this array are (Qi, 2017; Arjovsky et al., 2017; Gulrajani et al., 2017), which proposed methods to control the Lipschitz constant of the discriminator by adding regularizationtermsdefinedoninputexamplesx. Wewouldfollowtheirfootstepsandsearchfor thediscriminatorDfromthesetofK-Lipschitzcontinuousfunctions,thatis, argmaxV(G,D), (5) (cid:107)f(cid:107)Lip≤K 2 PublishedasaconferencepaperatICLR2018 wherewemeanby(cid:107)f(cid:107) thesmallestvalueM suchthat(cid:107)f(x)−f(x(cid:48))(cid:107)/(cid:107)x−x(cid:48)(cid:107)≤M forany Lip x,x(cid:48),withthenormbeingthe(cid:96) norm. 2 Whileinputbasedregularizationsallowforrelativelyeasyformulationsbasedonsamples,theyalso sufferfromthefactthat,theycannotimposeregularizationonthespaceoutsideofthesupportsof the generator and data distributions without introducing somewhat heuristic means. A method we wouldintroduceinthispaper,calledspectralnormalization,isamethodthataimstoskirtthisissue bynormalizingtheweightmatricesusingthetechniquedevisedbyYoshida&Miyato(2017). 2.1 SPECTRALNORMALIZATION OurspectralnormalizationcontrolstheLipschitzconstantofthediscriminatorfunctionf byliterally constrainingthespectralnormofeachlayerg : h (cid:55)→ h . Bydefinition,Lipschitznorm(cid:107)g(cid:107) in out Lip isequaltosup σ(∇g(h)),whereσ(A)isthespectralnormofthematrixA(L matrixnormofA) h 2 (cid:107)Ah(cid:107) σ(A):= max 2 = max (cid:107)Ah(cid:107) , (6) h:h(cid:54)=0 (cid:107)h(cid:107)2 (cid:107)h(cid:107)2≤1 2 whichisequivalenttothelargestsingularvalueofA. Therefore,foralinearlayerg(h)=Wh,the normisgivenby(cid:107)g(cid:107) = sup σ(∇g(h)) = sup σ(W) = σ(W). IftheLipschitznormofthe Lip h h activationfunction(cid:107)a (cid:107) isequalto11,wecanusetheinequality(cid:107)g ◦g (cid:107) ≤(cid:107)g (cid:107) ·(cid:107)g (cid:107) l Lip 1 2 Lip 1 Lip 2 Lip toobservethefollowingboundon(cid:107)f(cid:107) : Lip (cid:107)f(cid:107) ≤(cid:107)(h (cid:55)→WL+1h )(cid:107) ·(cid:107)a (cid:107) ·(cid:107)(h (cid:55)→WLh )(cid:107) Lip L L Lip L Lip L−1 L−1 Lip L+1 L+1 (cid:89) (cid:89) ···(cid:107)a (cid:107) ·(cid:107)(h (cid:55)→W1h )(cid:107) = (cid:107)(h (cid:55)→Wlh )(cid:107) = σ(Wl). (7) 1 Lip 0 0 Lip l−1 l−1 Lip l=1 l=1 OurspectralnormalizationnormalizesthespectralnormoftheweightmatrixW sothatitsatisfies theLipschitzconstraintσ(W)=1: W¯ (W):=W/σ(W). (8) SN If we normalize each Wl using (8), we can appeal to the inequality (7) and the fact that σ(cid:0)W¯ (W)(cid:1)=1toseethat(cid:107)f(cid:107) isboundedfromaboveby1. SN Lip Here, we would like to emphasize the difference between our spectral normalization and spectral norm”regularization”introducedby Yoshida&Miyato(2017). Unlikeourmethod,spectralnorm ”regularization”penalizesthespectralnormbyaddingexplicitregularizationtermtotheobjective function. Their method is fundamentally different from our method in that they do not make an attemptto‘set’thespectralnormtoadesignatedvalue.Moreover,whenwereorganizethederivative of our normalized cost function and rewrite our objective function (12), we see that our method is augmenting the cost function with a sample data dependent regularization function. Spectral normregularization,ontheotherhand,imposessampledataindependentregularizationonthecost function,justlikeL2regularizationandLasso. 2.2 FASTAPPROXIMATIONOFTHESPECTRALNORMσ(W) As we mentioned above, the spectral norm σ(W) that we use to regularize each layer of the dis- criminator is the largest singular value of W. If we naively apply singular value decomposition to compute the σ(W) at each round of the algorithm, the algorithm can become computationally heavy. Instead,wecanusethepoweriterationmethodtoestimateσ(W)(Golub&VanderVorst, 2000; Yoshida& Miyato, 2017). With power iterationmethod, we can estimate the spectral norm withverysmalladditionalcomputationaltimerelativetothefullcomputationalcostofthevanilla GANs. PleaseseeAppendixAforthedetailmethodandAlgorithm 1forthesummaryoftheactual spectralnormalizationalgorithm. 1Forexamples,ReLU(Jarrettetal.,2009;Nair&Hinton,2010;Glorotetal.,2011)andleakyReLU(Maas etal.,2013)satisfiesthecondition,andmanypopularactivationfunctionssatisfyK-Lipschitzconstraintfor somepredefinedKaswell. 3 PublishedasaconferencepaperatICLR2018 2.3 GRADIENTANALYSISOFTHESPECTRALLYNORMALIZEDWEIGHTS Thegradient2ofW¯ (W)withrespecttoW is: SN ij ∂W¯ (W) 1 1 ∂σ(W) 1 [u vT] SN = E − W = E − 1 1 ijW (9) ∂W σ(W) ij σ(W)2 ∂W σ(W) ij σ(W)2 ij ij = 1 (cid:0)E −[u vT] W¯ (cid:1), (10) σ(W) ij 1 1 ij SN where E is the matrix whose (i,j)-th entry is 1 and zero everywhere else, and u and v are ij 1 1 respectivelythefirstleftandrightsingularvectorsofW. Ifhisthehiddenlayerinthenetworkto betransformedbyW¯ ,thederivativeoftheV(G,D)calculatedoverthemini-batchwithrespect SN toW ofthediscriminatorDisgivenby: ∂V∂(GW,D) = σ(1W)(cid:0)Eˆ(cid:2)δhT(cid:3)−(cid:0)Eˆ(cid:2)δTW¯SNh(cid:3)(cid:1)u1v1T(cid:1) (11) = σ(1W)(cid:0)Eˆ(cid:2)δhT(cid:3)−λu1v1T(cid:1) (12) whereδ :=(cid:0)∂V(G,D)/∂(cid:0)W¯SNh(cid:1)(cid:1)T,λ:=Eˆ(cid:2)δT(cid:0)W¯SNh(cid:1)(cid:3),andEˆ[·]representsempiricalexpec- tationoverthemini-batch. ∂∂WV =0whenEˆ[δhT]=ku1v1T forsomek ∈R. We would like to comment on the implication of (12). The first term Eˆ(cid:2)δhT(cid:3) is the same as the derivative of the weights without normalization. In this light, the second term in the expression can be seen as the regularization term penalizing the first singular components with the adaptive regularizationcoefficientλ. λispositivewhenδ andW¯ harepointinginsimilardirection, and SN thispreventsthecolumnspaceofW fromconcentratingintooneparticulardirectioninthecourse of the training. In other words, spectral normalization prevents the transformation of each layer from becoming to sensitive in one direction. We can also use spectral normalization to devise a new parametrization for the model. Namely, we can split the layer map into two separate train- able components: spectrally normalized map and the spectral norm constant. As it turns out, this parametrizationhasitsmeritonitsownandpromotestheperformanceofGANs(SeeAppendixE). 3 SPECTRAL NORMALIZATION VS OTHER REGULARIZATION TECHNIQUES TheweightnormalizationintroducedbySalimans&Kingma(2016)isamethodthatnormalizesthe (cid:96) normofeachrowvectorintheweightmatrix. Mathematically,thisisequivalenttorequiringthe 2 weightbytheweightnormalizationW¯ : WN σ (W¯ )2+σ (W¯ )2+···+σ (W¯ )2 =d , whereT =min(d ,d ), (13) 1 WN 2 WN T WN o i o where σ (A) is a t-th singular value of matrix A. Therefore, up to a scaler, this is same as the t Frobenius normalization, which requires the sum of the squared singular values to be 1. These normalizations,however,inadvertentlyimposemuchstrongerconstraintonthematrixthanintended. IfW¯ istheweightnormalizedmatrixofdimensiond ×d ,thenorm(cid:107)W¯ h(cid:107) forafixedunit WN √ i o √ WN 2 vector h is maximized at (cid:107)W¯ h(cid:107) = d when σ (W¯ ) = d and σ (W¯ ) = 0 for WN 2 o 1 WN o t WN t = 2,...,T, which means that W¯ is of rank one. Similar thing can be said to the Frobenius WN normalization(Seetheappendixformoredetails). UsingsuchW¯ correspondstousingonlyone WN featuretodiscriminatethemodelprobabilitydistributionfromthetarget. Inordertoretainasmuch normoftheinputaspossibleandhencetomakethediscriminatormoresensitive,onewouldhope to make the norm of W¯ h large. For weight normalization, however, this comes at the cost of WN reducingtherankandhencethenumberoffeaturestobeusedforthediscriminator. Thus,thereisa conflictofinterestsbetweenweightnormalizationandourdesiretouseasmanyfeaturesaspossible todistinguishthegeneratordistributionfromthetargetdistribution. Theformerinterestoftenreigns over the other in many cases, inadvertently diminishing the number of features to be used by the discriminators. Consequently, the algorithm would produce a rather arbitrary model distribution 2Indeed, when the spectrum has multiplicities, we would be looking at subgradients here. However, the probabilityofthishappeningiszero(almostsurely),sowewouldcontinuediscussionswithoutgivingconsid- erationstosuchevents. 4 PublishedasaconferencepaperatICLR2018 that matches the target distribution only at select few features. Weight clipping (Arjovsky et al., 2017)alsosuffersfromsamepitfall. Our spectral normalization, on the other hand, do not suffer from such a conflict in interest. Note thattheLipschitzconstantofalinearoperatorisdeterminedonlybythemaximumsingularvalue. In other words, the spectral norm is independent of rank. Thus, unlike the weight normalization, our spectral normalization allows the parameter matrix to use as many features as possible while satisfyinglocal1-Lipschitzconstraint. Ourspectralnormalizationleavesmorefreedominchoosing thenumberofsingularcomponents(features)tofeedtothenextlayerofthediscriminator. Brocketal.(2016)introducedorthonormalregularizationoneachweighttostabilizethetrainingof GANs. In their work, Brock et al. (2016) augmented the adversarial objective function by adding thefollowingterm: (cid:107)WTW −I(cid:107)2. (14) F While this seems to serve the same purpose as spectral normalization, orthonormal regularization aremathematicallyquitedifferentfromourspectralnormalizationbecausetheorthonormalregular- izationdestroystheinformationaboutthespectrumbysettingallthesingularvaluestoone. Onthe otherhand,spectralnormalizationonlyscalesthespectrumsothattheitsmaximumwillbeone. Gulrajani et al. (2017) used Gradient penalty method in combination with WGAN. In their work, they placed K-Lipschitz constant on the discriminator by augmenting the objective function with the regularizer that rewards the function for having local 1-Lipschitz constant (i.e. (cid:107)∇ f(cid:107) = 1) xˆ 2 at discrete sets of points of the form xˆ := (cid:15)x˜ +(1−(cid:15))x generated by interpolating a sample x˜ from generative distribution and a sample x from the data distribution. While this rather straight- forward approach does not suffer from the problems we mentioned above regarding the effective dimensionofthefeaturespace,theapproachhasanobviousweaknessofbeingheavilydependent onthesupportofthecurrentgenerativedistribution. Asamatterofcourse,thegenerativedistribu- tionanditssupportgraduallychangesinthecourseofthetraining,andthiscandestabilizetheeffect ofsuchregularization. Infact,weempiricallyobservedthatahighlearningratecandestabilizethe performanceofWGAN-GP.Onthecontrary,ourspectralnormalizationregularizesthefunctionthe operatorspace, andtheeffectoftheregularizationismorestablewithrespecttothechoiceofthe batch. Trainingwithourspectralnormalizationdoesnoteasilydestabilizewithaggressivelearning rate. Moreover,WGAN-GPrequiresmorecomputationalcostthanourspectralnormalizationwith single-step power iteration, because the computation of (cid:107)∇ f(cid:107) requires one whole round of for- xˆ 2 wardandbackwardpropagation. Intheappendixsection,wecomparethecomputationalcostofthe twomethodsforthesamenumberofupdates. 4 EXPERIMENTS In order to evaluate the efficacy of our approach and investigate the reason behind its efficacy, we conductedasetofextensiveexperimentsofunsupervisedimagegenerationonCIFAR-10(Torralba etal.,2008)andSTL-10(Coatesetal.,2011),andcomparedourmethodagainstothernormalization techniques. To see how our method fares against large dataset, we also applied our method on ILSVRC2012dataset(ImageNet)(Russakovskyetal.,2015)aswell. Thissectionisstructuredas follows. First, we will discuss the objective functions we used to train the architecture, and then we will describe the optimization settings we used in the experiments. We will then explain two performance measures on the images to evaluate the images produced by the trained generators. Finally,wewillsummarizeourresultsonCIFAR-10,STL-10,andImageNet. Asforthearchitectureofthediscriminatorandgenerator, weusedconvolutionalneuralnetworks. Also,fortheevaluationofthespectralnormfortheconvolutionalweightW ∈Rdout×din×h×w,we treatedtheoperatorasa2-Dmatrixofdimensiond ×(d hw)3. Wetrainedtheparametersofthe out in generatorwithbatchnormalization(Ioffe&Szegedy,2015). WereferthereaderstoTable3inthe appendixsectionformoredetailsofthearchitectures. 3Notethat,sinceweareconductingtheconvolutiondiscretely,thespectralnormwilldependonthesizeof thestrideandpadding.However,theanswerwillonlydifferbysomepredefinedK. 5 PublishedasaconferencepaperatICLR2018 For all methods other than WGAN-GP, we used the following standard objective function for the adversarialloss: V(G,D):= E [logD(x)]+ E [log(1−D(G(z)))], (15) x∼qdata(x) z∼p(z) wherez ∈Rdz isalatentvariable,p(z)isthestandardnormaldistributionN(0,I),andG:Rdz → Rd0isadeterministicgeneratorfunction.Wesetdzto128forallofourexperiments.Fortheupdates ofG,weusedthealternatecostproposedbyGoodfellowetal.(2014)−E [log(D(G(z)))]as z∼p(z) usedinGoodfellowetal.(2014)andWarde-Farley&Bengio(2017). FortheupdatesofD,weused theoriginalcostdefinedin(15). Wealsotestedtheperformanceofthealgorithmwiththeso-called hingeloss,whichisgivenby (cid:104) (cid:16) (cid:16) (cid:17)(cid:17)(cid:105) VD(Gˆ,D)= E [min(0,−1+D(x))]+ E min 0,−1−D Gˆ(z) (16) x∼qdata(x) z∼p(z) (cid:104) (cid:105) VG(G,Dˆ)=− E Dˆ(G(z)) , (17) z∼p(z) respectively for the discriminator and the generator. Optimizing these objectives is equivalent to minimizingtheso-calledreverseKLdivergence: KL[p ||q ]. Thistypeoflosshasbeenalready g data proposed and used in Lim & Ye (2017); Tran et al. (2017). The algorithm based on the hinge lossalsoshowedgoodperformancewhenevaluatedwithinceptionscoreandFID.ForWasserstein GANswithgradientpenalty(WGAN-GP)(Gulrajanietal.,2017),weusedthefollowingobjective function:V(G,D):=Ex∼qdata[D(x)]−Ez∼p(z)[D(G(z))]−λExˆ∼pxˆ[((cid:107)∇xˆD(xˆ)(cid:107)2−1)2],where theregularizationtermistheoneweintroducedintheappendixsectionD.4. Forquantitativeassessmentofgeneratedexamples,weusedinceptionscore(Salimansetal.,2016) andFre´chetinceptiondistance(FID)(Heuseletal.,2017). PleaseseeAppendixB.1forthedetails ofeachscore. 4.1 RESULTSONCIFAR10ANDSTL-10 Inthissection,wereporttheaccuracyofthespectralnormalization(weusetheabbreviation: SN- GAN for the spectrally normalized GANs) during the training, and the dependence of the algo- rithm’s performance on the hyperparmeters of the optimizer. We also compare the performance quality of the algorithm against those of other regularization/normalization techniques for the dis- criminator networks, including: Weight clipping (Arjovsky et al., 2017), WGAN-GP (Gulrajani etal.,2017),batch-normalization(BN)(Ioffe&Szegedy,2015),layernormalization(LN)(Baetal., 2016),weightnormalization(WN)(Salimans&Kingma,2016)andorthonormalregularization(or- thonormal)(Brocketal.,2016). Inordertoevaluatethestand-aloneefficacyofthegradientpenalty, wealsoappliedthegradientpenaltytermtothestandardadversariallossofGANs(15). Wewould refer to this method as ‘GAN-GP’. For weight clipping, we followed the original work Arjovsky etal.(2017)andsettheclippingconstantcat0.01fortheconvolutionalweightofeachlayer. For gradientpenalty,wesetλto10,assuggestedinGulrajanietal.(2017). Fororthonormal,weinitial- izedtheeachweightofD witharandomlyselectedorthonormaloperatorandtrainedGANswith the objective function augmented with the regularization term used in Brock et al. (2016). For all comparativestudiesthroughout,weexcludedthemultiplierparameterγintheweightnormalization method,aswellasinbatchnormalizationandlayernormalizationmethod. Thiswasdoneinorder topreventthemethodsfromovertlyviolatingtheLipschitzcondition. Whenweexperimentedwith differentmultiplierparameter,wewereinfactnotabletoachieveanyimprovement. Foroptimization,weusedtheAdamoptimizerKingma&Ba(2015)inallofourexperiments. We tested with 6 settings for (1) n , the number of updates of the discriminator per one update of dis thegeneratorand(2)learningrateαandthefirstandsecondordermomentumparameters(β ,β ) 1 2 of Adam. We list the details of these settings in Table 1 in the appendix section. Out of these 6 settings, A, B, and C are the settings used in previous representative works. The purpose of the settings D, E, and F is to the evaluate the performance of the algorithms implemented with more aggressive learning rates. For the details of the architectures of convolutional networks deployed inthegeneratorandthediscriminator,wereferthereaderstoTable3intheappendixsection. The numberofupdatesforGANgeneratorwere100Kforallexperiments,unlessotherwisenoted. Firstly,weinspectedthespectralnormofeachlayerduringthetrainingtomakesurethatourspectral normalization procedure is indeed serving its purpose. As we can see in the Figure 9 in the C.1, 6 PublishedasaconferencepaperatICLR2018 Table1: Hyper-parametersettingswetestedinourexperiments. †,‡and(cid:63)arethehyperparameter settingsfollowingGulrajanietal.(2017),Warde-Farley&Bengio(2017)andRadfordetal.(2016), respectively. Setting α β β n 1 2 dis A† 0.0001 0.5 0.9 5 B‡ 0.0001 0.5 0.999 1 C(cid:63) 0.0002 0.5 0.999 1 D 0.001 0.5 0.9 5 E 0.001 0.5 0.999 5 F 0.001 0.9 0.999 5 Inception score12345678 ABCDEF Inception score123456789 ABCDEF 0 0 Weight clip. GAN-GP WGAN-GP BN LN WN Orthonormal SN Weight clip. WGAN-GP LN WN Orthonormal SN (a) CIFAR-10 (b) STL-10 Figure1: InceptionscoresonCIFAR-10andSTL-10withdifferentmethodsandhyperparameters (higherisbetter). the spectral norms of these layers floats around 1–1.05 region throughout the training. Please see AppendixC.1formoredetails. InFigures1and 2weshowtheinceptionscoresofeachmethodwiththesettingsA–F.Wecansee thatspectralnormalizationisrelativelyrobustwithaggressivelearningratesandmomentumparam- eters. WGAN-GPfailstotraingoodGANsathighlearningratesandhighmomentumparameters onbothCIFAR-10andSTL-10. OrthonormalregularizationperformedpoorlyforthesettingEon the STL-10, but performed slightly better than our method with the optimal setting. These results suggests that our method is more robust than other methods with respect to the change in the set- ting of the training. Also, the optimal performance of weight normalization was inferior to both WGAN-GP and spectral normalization on STL-10, which consists of more diverse examples than CIFAR-10. Best scores of spectral normalization are better than almost all other methods on both CIFAR-10andSTL-10. InTables2,weshowtheinceptionscoresofthedifferentmethodswithoptimalsettingsonCIFAR- 10 and STL-10 dataset. We see that SN-GANs performed better than almost all contemporaries on the optimal settings. SN-GANs performed even better with hinge loss (17).4. For the training withsamenumberofiterations,SN-GANsfellbehindorthonormalregularizationforSTL-10. For more detailed comparison between orthonormal regularization and spectral normalization, please seesection4.1.2. In Figure 6 we show the images produced by the generators trained with WGAN-GP, weight nor- malization,andspectralnormalization. SN-GANswereconsistentlybetterthanGANswithweight normalization in terms of the quality of generated images. To be more precise, as we mentioned in Section 3, the set of images generated by spectral normalization was clearer and more diverse thantheimagesproducedbytheweightnormalization. WecanalsoseethatWGAN-GPfailedto traingoodGANswithhighlearningratesandhighmomentums(D,EandF).Thegeneratedimages 4AsforSTL-10,wealsoranSN-GANsovertwicetimelongeriterationsbecauseitdidnotseemtoconverge. Yetstill,thiselongatedtrainingsequencestillcompletesbeforeWGAN-GPwithoriginaliterationsizebecause theoptimalsettingofSN-GANs(settingB,n =1)iscomputationallylight. dis 7 PublishedasaconferencepaperatICLR2018 A A B B FID102 CD FID102 CD E E F F Weight clip. GAN-GP WGAN-GP BN LN WN Orthonormal SN 101 Weight clip. WGAN-GP LN WN Orthonormal SN (a) CIFAR-10 (b) STL-10 Figure 2: FIDs on CIFAR-10 and STL-10 with different methods and hyperparameters (lower is better). Table2: InceptionscoresandFIDswithunsupervisedimagegenerationonCIFAR-10. †(Radford etal.,2016)(experimentedby Yangetal.(2017)),‡(Yangetal.,2017),∗(Warde-Farley&Bengio, 2017),††(Gulrajanietal.,2017) Inceptionscore FID Method CIFAR-10 STL-10 CIFAR-10 STL-10 Realdata 11.24±.12 26.08±.26 7.8 7.9 -StandardCNN- Weightclipping 6.41±.11 7.57±.10 42.6 64.2 GAN-GP 6.93±.08 37.7 WGAN-GP 6.68±.06 8.42±.13 40.2 55.1 BatchNorm. 6.27±.10 56.3 LayerNorm. 7.19±.12 7.61±.12 33.9 75.6 WeightNorm. 6.84±.07 7.16±.10 34.7 73.4 Orthonormal 7.40±.12 8.56±.07 29.0 46.7 (ours)SN-GANs 7.42±.08 8.28±.09 29.3 53.1 Orthonormal(2xupdates) 8.67±.08 44.2 (ours)SN-GANs(2xupdates) 8.69±.09 47.5 (ours)SN-GANs,Eq.(17) 7.58±.12 25.5 (ours)SN-GANs,Eq.(17)(2xupdates) 8.79±.14 43.2 -ResNet-5 Orthonormal,Eq.(17) 7.92±.04 8.72±.06 23.8±.58 42.4±.99 (ours)SN-GANs,Eq.(17) 8.22±.05 9.10±.04 21.7±.21 40.1±.50 DCGAN† 6.64±.14 7.84±.07 LR-GANs‡ 7.17±.07 Warde-Farleyetal.∗ 7.72±.13 8.51±.13 WGAN-GP(ResNet)†† 7.86±.08 withGAN-GP,batchnormalization,andlayernormalizationisshowninFigure12intheappendix section. Wealso comparedour algorithmagainstmultiple benchmarkmethods anssummarizedthe results onthebottomhalfoftheTable2. WealsotestedtheperformanceofourmethodonResNetbased GANs used in Gulrajani et al. (2017). Please note that all methods listed thereof are all different in both optimization methods and the architecture of the model. Please see Table 4 and 5 in the appendix section for the detail network architectures. Our implementation of our algorithm was abletoperformbetterthanalmostallthepredecessorsintheperformance. 5For our ResNet experiments, we trained the same architecture with multiple random seeds for weight initializationandproducedmodelswithdifferentparameters. Wethengenerated5000images10timesand computedtheaverageinceptionscoreforeachmodel. ThevaluesforResNetonthetablearethemeanand standarddeviationofthescorecomputedoverthesetofmodelstrainedwithdifferentseeds. 8 PublishedasaconferencepaperatICLR2018 Layer : 1 Layer : 2 Layer : 3 Layer : 4 Layer : 5 Layer : 6 Layer : 7 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.8 0.8 0.8 0.8 0.8 0.8 0.8 WC 20.6 0.6 0.6 0.6 0.6 0.6 0.6 WN 0.4 0.4 0.4 0.4 0.4 0.4 0.4 SN 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.00 13 260.00 32 630.00 64 1270.00 64 1270.00 128 2550.00 128 2550.00 256 511 Index of (a) CIFAR-10 Layer : 1 Layer : 2 Layer : 3 Layer : 4 Layer : 5 Layer : 6 Layer : 7 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.8 0.8 0.8 0.8 0.8 0.8 0.8 WC 20.6 0.6 0.6 0.6 0.6 0.6 0.6 WN 0.4 0.4 0.4 0.4 0.4 0.4 0.4 SN 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.00 13 260.00 32 630.00 64 1270.00 64 1270.00 128 2550.00 128 2550.00 256 511 Index of (b) STL-10 Figure3: Squaredsingularvaluesofweightmatricestrainedwithdifferentmethods: Weightclip- ping(WC),WeightNormalization(WN)andSpectralNormalization(SN).Wescaledthesingular valuessothatthelargestsingularvaluesisequalto1.ForWNandSN,wecalculatedsingularvalues ofthenormalizedweightmatrices. 4.1.1 ANALYSISOFSN-GANS Singular values analysis on the weights of the discriminator D In Figure 3, we show the squaredsingularvaluesoftheweightmatricesinthefinaldiscriminatorDproducedbyeachmethod usingtheparameterthatyieldedthebestinceptionscore. AswepredictedinSection3,thesingular valuesofthefirsttofifthlayerstrainedwithweightclippingandweightnormalizationconcentrate on a few components. That is, the weight matrices of these layers tend to be rank deficit. On the otherhand,thesingularvaluesoftheweightmatricesinthoselayerstrainedwithspectralnormal- izationismorebroadlydistributed.Whenthegoalistodistinguishapairofprobabilitydistributions onthelow-dimensionalnonlineardatamanifoldembeddedinahighdimensionalspace,rankdefi- ciencies inlower layers can be especially fatal. Outputsof lower layers havegone through only a few sets of rectified linear transformations, which means that they tend to lie on the space that is linear in most parts. Marginalizing out many features of the input distribution in such space can resultinoversimplifieddiscriminator. Wecanactuallyconfirmtheeffectofthisphenomenononthe generatedimagesespeciallyinFigure6b.Theimagesgeneratedwithspectralnormalizationismore diverseandcomplexthanthosegeneratedwithweightnormalization. Trainingtime OnCIFAR-10,SN-GANsisslightlyslowerthanweightnormalization(about110 ∼ 120% computational time), but significantly faster than WGAN-GP. As we mentioned in Sec- tion3,WGAN-GPisslowerthanothermethodsbecauseWGAN-GPneedstocalculatethegradient ofgradientnorm(cid:107)∇ D(cid:107) . ForSTL-10,thecomputationaltimeofSN-GANsisalmostthesameas x 2 vanillaGANs,becausetherelativecomputationalcostofthepoweriteration(18)isnegligiblewhen comparedtothecostofforwardandbackwardpropagationonCIFAR-10(imagessizeofSTL-10is larger(48×48)). PleaseseeFigure10intheappendixsectionfortheactualcomputationaltime. 4.1.2 COMPARISONBETWEENSN-GANSANDORTHONORMALREGULARIZATION Inordertohighlightthedifferencebetweenourspectralnormalizationandorthonormalregulariza- tion, we conducted an additional set of experiments. As we explained in Section 3, orthonormal regularizationisdifferentfromourmethodinthatitdestroysthespectralinformationandputsequal emphasis on all feature dimensions, including the ones that ’shall’ be weeded out in the training process. Toseetheextentofitspossiblydetrimentaleffect,weexperimentedbyincreasingthedi- 9 PublishedasaconferencepaperatICLR2018 8.6 SN-GANs 8.5 Orthonormal e8.4 or n sc8.3 ptio8.2 e nc8.1 I 8.0 7.9 0.51.0 2.0 4.0 6.0 8.0 Relative size of feature map dimension (original=1.0) Figure 4: The effect on the performance on STL-10 induced by the change of the feature map dimensionofthefinallayer.Thewidthofthehighlightedregionrepresentsstandarddeviationofthe resultsovermultipleseedsofweightinitialization.Theorthonormalregularizationdoesnotperform wellwithlargefeaturemapdimension,possiblybecauseofitsdesignthatforcesthediscriminator to use all dimensions including the ones that are unnecessary. For the setting of the optimizers’ hyper-parameters,WeusedthesettingC,whichwasoptimalfor“orthonormalregularization” 22 20 e or18 c s on 16 pti e c14 n i SN-GANs 12 Orthnormal 10 0.0 1.0 2.0 3.0 4.04.5 iteration 1e5 Figure 5: Learning curves for conditional image generation in terms of Inception score for SN- GANsandGANswithorthonormalregularizationonImageNet. mensionofthefeaturespace6,especiallyatthefinallayer(7thconv)forwhichthetrainingwithour spectralnormalizationprefersrelativelysmallfeaturespace(dimension< 100;seeFigure3b). As forthesettingofthetraining,weselectedtheparametersforwhichtheorthonormalregularization performed optimally. The figure 4 shows the result of our experiments. As we predicted, the per- formanceoftheorthonormalregularizationdeterioratesasweincreasethedimensionofthefeature mapsatthefinallayer. OurSN-GANs,ontheotherhand,doesnotfalterwiththismodificationof thearchitecture. Thus,atleastinthisperspective,wemaysuchthatourmethodismorerobustwith respecttothechangeofthenetworkarchitecture. 4.2 IMAGEGENERATIONONIMAGENET To show that our method remains effective on a large high dimensional dataset, we also applied ourmethodtothetrainingofconditionalGANsonILRSVRC2012datasetwith1000classes,each consistingofapproximately1300images,whichwecompressedto128×128pixels. Regardingthe adversarial loss for conditional GANs, we used practically the same formulation used in Mirza & Osindero(2014),exceptthatwereplacedthestandardGANslosswithhingeloss(17). Pleasesee AppendixB.3forthedetailsofexperimentalsettings. 6Moreprecisely,wesimplyincreasedtheinputdimensionandtheoutputdimensionbythesamefactor. In Figure 4,‘relativesize’=1.0impliesthatthelayerstructureisthesameastheoriginal. 10

Description:
spectral normalization for GANs against other regularization techniques, such as weight Augustus Odena, Christopher Olah, and Jonathon Shlens.
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.