PREPRINT 1 Loss-Sensitive Generative Adversarial Networks on Lipschitz Densities Guo-Jun Qi, Member, IEEE Abstract—Inthispaper,wepresentanovelLoss-SensitiveGAN(LS-GAN)thatlearnsalossfunctiontoseparategeneratedsamples fromtheirrealexamples.AnimportantpropertyoftheLS-GANisitallowsthegeneratortofocusonimprovingpoordatapointsthatare farapartfromrealexamplesratherthanwastingeffortsonthosesamplesthathavealreadybeenwellgenerated,andthuscanimprove theoverallqualityofgeneratedsamples.ThetheoreticalanalysisalsoshowsthattheLS-GANcangeneratesamplesfollowingthetrue 7 1 datadensity.Inparticular,wepresentaregularityconditionontheunderlyingdatadensity,whichallowsustouseaclassofLipschitz 0 lossesandgeneratorstomodeltheLS-GAN.ItrelaxestheassumptionthattheclassicGANshouldhaveinfinitemodelingcapacityto 2 obtainthesimilartheoreticalguarantee.Furthermore,weshowthegeneralizationabilityoftheLS-GANbyboundingthedifference betweenthemodelperformancesovertheempiricalandrealdistributions,aswellasderivingatractablesamplecomplexitytotrainthe r LS-GANmodelintermsofitsgeneralizationability.Wealsoderiveanon-parametricsolutionthatcharacterizestheupperandlower a M boundsofthelosseslearnedbytheLS-GAN,bothofwhicharecone-shapedandhavenon-vanishinggradientalmosteverywhere. ThisshowstherewillbesufficientgradienttoupdatethegeneratoroftheLS-GANevenifthelossfunctionisover-trained,relievingthe vanishinggradientproblemintheclassicGAN.WealsoextendtheunsupervisedLS-GANtoaconditionalmodelgeneratingsamples 5 basedongivenconditions,andshowitsapplicationsinbothsupervisedandsemi-supervisedlearningproblems.Weconduct 1 experimentstocomparedifferentmodelsonbothgenerationandclassificationtasks,andshowtheLS-GANisresilientagainst ] vanishinggradientandmodelcollapseevenwithovertrainedlossfunctionormildlychangednetworkarchitecture. V C IndexTerms—Loss-SensitiveGAN,Lipschitzdensity,imagegenerationandclassification ✦ . s c [ 1 INTRODUCTION 5 v AClassic Generative Adversarial Net (GAN) [1] learns between real and unreal samples. It would in turn limit 4 a discriminator and a generator simultaneously by the ability of learning a better generator that relies on an 6 playing a two-player minimax game to generate samples exact discriminator to capture the difference between real 2 followingtheunderlyingdatadensity.Forthispurpose,the andunrealexamples. 6 discriminator is trained to distinguish real samples from In addition, from a theoretical perspective, the analysis 0 those generated by the generator, which in turn guides the behind the GAN makes a non-parametric assumption that . 1 generatortoproducesamplesthatcanmakethediscrimina- the model has infinite modeling capacity [1] in order to 0 torbelievetheyarereal. provethedensityofgeneratedsamplesmatchestheunder- 7 The family of GAN models have demonstrated very lyingdatadensitywe wishtoestimate.Thisisa toostrong 1 impressive performances on synthesizing a wide range of assumption to hold: even a very deep network could not : v structured data, as diverse as images [2], videos [3], music assume infinite modeling capacity to map any given input i [4] and even poems [5]. Take image synthesis as an ex- to an arbitrarily desired output. Even worse, a generative X ample. On one hand, the discriminator seeks to learn the model with unlimited capacity would also be the cause of r a probability of a sample being a photo-realistic image. It vanishinggradientintheclassicGAN.Asprovedin[6],the treats natural image examples as positive examples, while discriminator with infinite ability of separating real from the images produced by a paired generator as negative ex- generated samples will lead to a constant Jensen-Shannon amples. Meanwhile, the generator aims to produce images (JS)divergence betweenthe generateddensityandthetrue that can make the discriminator believe they are real. A datadensityiftheirsupportshavenoornegligibleoverlap. minimaxoptimizationproblemissolvedtojointlyoptimize Thiscausesthevanishinggradientthatmakesitimpossible thediscriminatorandthegenerator. to update the generator as the discriminator is quickly Here, a dyadic treatment of real and generated data as trainedtowardsitsoptimality. positiveandnegativeexamplesmayoversimplifytheprob- Moreover,theGANwithinfinitemodelingabilitycould lem of learning a GAN model. Actually, as the generator also suffer from severe overfitting problem, and this is improves, its generated samples would become more and probably the reason for a collapsed generator that is stuck more closer to the manifold of real examples; however, inproducingthesamedatapointsincesuchamodelwould they are still being treated as negative examples to train be powerful enough to aggressively push its generated the discriminator in the classicGAN. Thiscould lead toan samplestothedensestmodeoftheunderlyingdensity.The over-pessimisticdiscriminator characterizing the boundary phenomenon has been observed in literature [2], [7], and a properlyregularizedlearningobjectiveispreferredtoavoid • G.-J. Qi was with the Department of Computer Science, University of themodecollapse. CentralFlorida,Orlando,FL,32816. Inthispaper,weattempttodeveloptheoryandmodels E-mail:[email protected] withoutassuminginfinitemodelingability,whichyieldsthe PREPRINT 2 proposed Loss-Sensitive GAN (LS-GAN). Specifically, we contributions. We will present the proposed LS-GAN in introducealossfunctiontoquantifythequalityofgenerated Section3.InSection4, wewillanalyzetheLS-GAN, show- samples.ThenaconstraintisimposedtotraintheLS-GAN ing that its generated samples follow the underlying data so that the loss of a real sample should be smaller than densityevenwithaclassofLipschitzlossesandgenerators. that of a generated counterpart by an unfixed margin that We will discuss the algorithm details in Section 5, as well depends on how close they are to each other in a metric as analyze its generalizability and sample complexity in space.Inthisway,ifageneratedsampleisalreadyveryclose training loss and generator functions. We will make a non- to a real example, the margin between their losses could parametric analysis of the algorithm in Section 6, followed vanish. This allows the model to focus on improving poor by a comparison with WassersteinGAN in Section 7. Then samples rather than wasting efforts on those samples that we will show how the model can be extended to a condi- have alreadybeen well generated with satisfactoryquality, tional model for supervised and semi-supervised learning therebyimprovingoverallqualityofgenerationresults. problems in Section 8. Experiment results are presented in We also develop a new theory to analyze the proposed Section9,andweconcludeinSection10. LS-GAN on Lipschitz densities. We note that the reason of havingtoassume“infinitemodelingcapacity”intheclassic 2 RELATED WORK AND OUR CONTRIBUTIONS GANisduetoitsambitiousgoaltomodelanarbitrarydata density without imposing any biases. However, a general It has bee a long-term goal to enable synthesis of highly principle in learning theory, no free lunch theorem [8], structureddatasuchasimagesandvideos. prefers “biased learning approaches” with suitable priors Deep generative models, especially the Generative Ad- on the underlying data distribution. This prompts us to versarial Net (GAN) [1], have attracted many attentions focus on a specific class of Lipschitz densities to model recently due to their demonstrated abilities of generating the underlying data. It contains a large family of real- real samples following the underlying data densities. In worlddistributions,wherethedatadensitydoesnotchange particular, the GAN attempts to learn a pair of discrimi- abruptly over data points that are close to one another. nator and generator by playing a maximin game to seek By defining the Lipschitz densities based on the distance an equilibrium, in which the discriminator is trained by metric specifying the loss margin, we prove the resulting distinguishing real samples from generated ones and the data density learned by the LS-GAN exactly matches the generatorisoptimizedtoproducesamplesthatcanfoolthe true data density even if the model is limited to the space discriminator. ofLipschitz-continuousfunctions.Thisisanontrivialrelax- A family of GAN architectures have been proposed to ationoftheinfinitemodelingabilityintheclassicGANfrom implement this idea. For example, recent progresses [2], thetheoreticpointofview. [7] have shown impressive performances on synthesizing More importantly, by limiting the LS-GAN to this Lip- photo-realisticimagesbyconstructingmultiplestridedand schitz space, we can prove the loss function and generator factional-stridedconvolutionallayersfordiscriminatorsand that are learned from an empirical distribution with finite generators. On the contrary, [9] proposed to use a Lapla- examples can well generalize to produce data points from cianpyramidtoproducehigh-qualityimagesbyiteratively therealdistribution.Wewillshowthisgeneralizationability addingmultiplelayersofnoisesatdifferentresolutions.[10] by deriving a tractable sample complexity to bound the presented to train a recurrent generative model by using difference of model performances over empirical and real adversarial training to unroll gradient-based optimizations distributions.We furtherpresenta non-parametricsolution tocreatehighqualityimages. to the LS-GAN. It does not rely on any parametricform of In addition to designing different GAN networks, re- functions, thereby characterizing the optimal loss function searcheffortshavebeenmadetotraintheGANbydifferent achievable in the whole space of Lipschtiz functions. This criteria. For example,[11] presentedanenergy-basedGAN non-parametric solution gives both the upper and lower by minimizing an energy function to learn an optimal dis- bounds of the optimal solution, which have non-vanishing criminator,andanauto-encoderstructureddiscriminatoris gradient. Thissuggeststhat the LS-GAN canprovide suffi- presented to compute the energy. The authors also present cient gradient to update its LS-GAN generator even if the a theoretical analysis by showing this variant of GAN can loss function has been fully optimized, thus avoiding the generatesampleswhosedensitycanrecovertheunderlying vanishinggradientproblemthatcouldoccurintrainingthe true data density. However, it still needs to assume the GANmodels. model has infinite modeling capacityto prove the result in Moreover, we generalize the model to a Conditional anon-parametricfashion.Thisisprobablyduetotheuseof LS-GAN (CLS-GAN) that can generate samples based on afixedmargintoseparategeneratedsamplesfromtraining givenconditions.Inparticular,consideringdifferentclasses examples.Thisisincontrasttothe useofa distancemetric as generative conditions, the learned loss function can be intheproposedLS-GANtospecifydata-dependentmargins usedasaclassifierforbothsupervisedandsemi-supervised under the Lipschitz density assumption. In addition, [12] learning.Theadvantageofsuchaclassifierliesinitsintrin- presented to analyze the GAN from information theoreti- sicabilityofexploringgeneratedexamplestorevealunseen cal perspective, and they seek to minimize the variational variations for different classes. Experiment results demon- estimate of f-divergence, and show that the classic GAN strate competitive performance of the CLS-GAN classifier is included as a special case of f-GAN. In contrast, Info- ascomparedwiththestate-of-the-artmodels. GAN [13] proposed another information-theoretic GAN to The remainder of this paper is organized as follows. learn disentangledrepresentationscapturingvarious latent Section 2 reviews the related work and summarizes our concepts and factors in generating samples. Most recently, PREPRINT 3 [14]proposetominimizetheEarth-Moverdistancebetween thedensityofgeneratedsamplesandthetrue datadensity, (cid:9)((cid:5) (cid:7) ) (cid:6) and they show the resultant Wasserstein GAN (WGAN) Generated samples canaddressthevanishinggradientproblemthattheclassic Δ((cid:3),(cid:5) (cid:7) ) (cid:6) GANsuffers. BesidestheclassofGANs,thereexistothermodelsthat (cid:9) (cid:3) Real examples also attempt to generate natural images. For example, [15] rendered images by matching features in a convolutional network withrespect toreference images.[16] useddecon- Fig. 1. Illustration of the idea behind LS-GAN. A margin is enforced volutional network to render 3D chair models in various to separate real samples from generated counterparts. The margin styles and viewpoints. [17] introduced a deep recurrent is not fixed to a constant. Instead it is data-dependent, which could neutralnetworkarchitectureforimagegenerationwithase- vanishasthegeneratorimprovestoproducebetterandbettersamples. We assume the density of real samples is Lipschitz as to prove the quence of variational auto-encoders to iteratively construct theoreticalresults. compleximages. Recent efforts have also been made on leveraging the learned representations by deep generative networks to We will prove the LS-GAN can reveal the true den- improvetheclassificationaccuracywhenitistoodifficultor sity even with limited modeling ability of bounded expensivetolabelsufficienttrainingexamples.Forexample, Lipschitz constant on the generators and loss func- [18] presentedvariationalauto-encoders [19] by combining tions. This is a nontrivial relaxation of the non- deep generative models and approximatevariationalinfer- parametric assumption on the classic GAN from encetoexplorebothlabeledandunlabeleddata.[2]treated both theoretic and practical perspectives. Moreover, the samples from the GAN generator as a new class, and we also characterize the optimal loss function by explore unlabeled examples by assigning them to a class derivingitslowerandupperbounds,andshowthey different from the newone. [20] proposedto traina ladder arecone-shapedandhavenon-vanishinggradiental- network[21]byminimizingthesumofsupervisedandun- mosteverywhere.Theoptimallossfunctionderived supervisedcostfunctionsthroughback-propagation,which between these two bounds are unlikely to saturate avoids the conventional layer-wise pre-training approach. withvanishinggradient,andthuscanprovidesuffi- [22]presentedanapproachtolearningadiscriminativeclas- cientgradienttocontinuouslytrainthegenerator. sifier by trading-off mutual information between observed examplesandtheirpredictedclassesagainstanadversarial generativemodel.[23]soughttojointlydistinguishbetween not only real and generated samples but also their latent 3 LOSS-SENSITIVE GAN variables in an adversarial process. These methods have In the proposed LS-GAN, we abandon to learn a discrimi- shownpromisingresultsforclassificationtasksbyleverag- natorthatusesaprobabilitytocharacterizethelikelihoodof ingdeepgenerativemodels. Inthispaper,weseektodevelopmodelsandalgorithms realsamples.Instead,weintroducealossfunctionLθ(x)to distinguish real and generated samples by the assumption thatareboththeoreticallysoundandpracticallycompetitive that a real example should have a smaller loss than a for data generation and classification tasks. Our contribu- generatedsample. tionsaresummarizedbelow. Formally,considerarealexamplexandageneratedone • We propose a Loss-Sensitive GAN (LS-GAN) model Gφ(z)withz∼Pz(z).Thelossfunctioncanbetrainedwith toproducehigh-qualitysamples.TheLS-GANlearns thefollowingconstraint: a loss function to quantify the quality of generated samples.Thelossofarealexampleshouldbesmaller Lθ(x)≤Lθ(Gφ(z))−∆(x,Gφ(z)) (1) than that of a generated sample by a margin char- acterized by their distance in a metric space. The where ∆(x,Gφ(z)) measuresthe difference betweenx and wellgeneratedsamplesclosetorealexamplesdonot Gφ(z). This constraint requires a real sample be separated needtotreatedasnegativeexamplesanymoresothat from a generated counterpart in terms of their losses by at moreeffortscanbefocusedonimprovingthequality leastamarginof∆(x,Gφ(z)).Figure1illustratesthisidea. ofpoorsamples. Itisnoteworthythatthemarginisnotfixedtoaconstant. • WealsogeneralizeLS-GAN toa conditionalversion Instead, it is data-dependent, which could vanish as the thatsharesthesametheoreticalmeritastheLS-GAN generatoris graduallyimprovedtoproducebettersamples butcangeneratesamplesalignedwithdesignedcon- as they become closer to real examples. For example, one ditions. Specifically, we consider to specify sample canchoosetheℓp-distancekx−Gφ(z)kpasthemargin.This classes as conditions, and this model can produce allows the model to focus on improving the poor samples multiple classes of examples that capture intra-class stillfarawayfromrealexamplesratherthanwastingefforts variations. This yields a classifier using the learned on those that are already well generated. In the theoretical lossfunctionandexploringthegeneratedsamplesto analysis,suchadata-dependentmarginwillalsobeusedto improveclassificationaccuracy. specify a Lipschitz condition, which plays a critical role in • We develop a new theory that introduces Lipschitz guaranteeinggeneratedsamplesfollowtheunderlyingdata regularity to characterize underlying data densities. density. PREPRINT 4 Nowlet usrelaxthe abovehardconstraintbyintroduc- The set of Lipschitz densities on a compact support ingaslackvariableξx,z contain a large family of distributions that are dense in the spaceof continuous densities.For example,the density Lθ(x)−ξx,z ≤Lθ(Gφ(z))−∆(x,Gφ(z)) (2) of natural images can be consider as Lipschitz continuous, ξx,z ≥0 (3) as the densities of two similar images in a neighborhood are unlikely to change abruptly. Moreover, one can restrict wheretheslackvariablewouldbenonzerowhenaviolation the image density in a compact support as an image has oftheconstraintoccurs. boundedpixelvalueson[0,255]. Therefore, with a fixed generator Gφ, the loss function ThisiscontrarytotheanalysisoftheclassicGAN,where parameterizedwithθcanbetrainedby one must assume both discriminator and generator have min E Lθ(x)+λ E ξx,z (4) infinite modeling ability to prove PG∗ equals Pdata. The θ x∼Pdata(x) x∼Pdata(x) Lipschitzassumptiononthedatadensityallowsustorelax s.t.,L (x)−ξ ≤L (G (zz∼)P)z−(z)∆(x,G (z)) suchastrongassumptiontoLipschitzlossfunctionLθ and θ x,z θ φ φ generator density PG. This results in the following lemma ξx,z ≥0 relatingPG∗ toPdata. where λ is a positive balancing parameter, and Pdata(x) Lemma 1. Under Assumption 1, given a Nash equilibrium is the data distribution for real samples. The first term (θ∗,φ∗)suchthatPG∗ isLipschitzcontinuous,wehave minimizestheexpectedlossfunctionoverdatadistribution since a smaller value of loss function is preferred on real 2 |Pdata(x)−PG∗(x)|dx≤ samples. The second term is the expected error caused by Z λ x theviolationoftheconstraint.Withoutlossofgenerality,we require the loss function should be nonnegative. Later we Thus,PG∗(x)convergestoPdata(x)asλ→+∞. willshowthatthisnonnegativerequirementcanbedropped Theproofofthislemmaisgivenintheappendix. insomecase. On the other hand, for a fixed loss function Lθ∗, one Remark 1. From this theorem, we find that by allowing λ can solve the following minimization problem to find an infinitelylarge,thelearneddensityPG∗(x)shouldexactlymatch optimalgenerator. the data density Pdata(x). In other words, we can simply disre- gardthe firstloss minimizationterm in(6)as itplaysno roleas min E Lθ∗(Gφ(z)) (5) λ → +∞. It is alsonot hard tosee that ifwe disregardthe first φ z∼Pz(z) minimization term, the requirement that the loss function Lθ is In summary, Lθ and Gφ are alternately optimized by nonnegative is not needed anymore to prove the above theorem, solvinganequilibrium(θ∗,φ∗)suchthatθ∗ minimizes and this gives us more flexibility in designing loss function for S(θ,φ∗)= E Lθ(x) (6) theLS-GAN. x∼Pdata(x) However, the reason that we still keep the loss minimization +λ E ∆(x,z )+L (x)−L (z ) term in the formulation will become clear after we develop the G θ θ G + x∼Pdata(x) (cid:0) (cid:1) conditionalLS-GANlater. zG∼PG∗(zG) Now we can show the existence of Nash equilibrium which is an equivalent compact form of (4) with (a) = max(a,0),andφ∗ minimizes + suchthatboththelossfunctionLθ andthedensityPG(zG) ofgeneratedsamplesareLipschitz. T(θ∗,φ)= E Lθ∗(zG) (7) LetFκbetheclassoffunctionswithaboundedLipschitz zG∼PG(zG) constant κ. It is not difficult to show that the space Fκ wherePG(zG)isthedensityofsamplesgeneratedbyGφ(z). is convex and compact if these functions are supported in a compact set 1. In addition, both S(θ,φ) and T(θ,φ) are 4 THEORETICAL ANALYSIS ecoxnisvteenxcfeuonfctaioNnsasihneLqθuailnibdriiunmPG(θ(z∗G,φ)∗.)ThweistehgbuoathraLntθe∗eatnhde Suppose (θ∗,φ∗) is a Nash equilibrium that jointly solves PG∗ inFκ,followingtheproofoftheclassicmixed-strategy (6) and (7). We will show that as λ → +∞, the density gametheorybyapplyingKakutanifixed-pointtheorem[24]. distribution PG∗ of the samples generated by Gφ∗ will Thus,wehavethefollowinglemma. convergetotheunderlyingdatadensityPdata. Lemma2. UnderAssumption1,thereexistsaNashequilibrium Toprovethisresult,weneedthefollowingdefinition. Definition.Foranytwosamplesxandz,thelossfunctionF(x) (θ∗,φ∗)suchthatLθ∗ andPG∗ areLipschitz. isLipschitzcontinuouswithrespecttoadistancemetric∆if Putting the above two lemmas together, we have the followingtheorem. |F(x)−F(z)|≤κ∆(x,z) Theorem 1. UnderAssumption1,aNashequilibrium(θ∗,φ∗) withaboundedLipschitzconstantκ,i.e,κ<+∞. existssuchthat To prove our main result, we assume the following regularityconditionontheunderlyingdatadensity. (i)Lθ∗ andPG∗ areLipschitz. Assumption1. ThedatadensityPdataissupportedinacompact 1.Forexample,thespaceofnaturalimagesiscompactastheirpixel set,anditisLipschitzcontinuous. valuesarerestrictedtoacompactrangeof[0,255]. PREPRINT 5 2 (ii) x|Pdata(x)−PG∗(x)|dx≤ λ →0,asλ→+∞; lossfunction.IftheLS-GANgeneralizeswell,theobjectives R λ over the empirical and real distributions should have a (iii)Pdata(x)≥ 1+λPG∗(x). smaller difference as more examples are drawn from Pdata andPG. It is worth noting that Arora et al. [25] has proposed a 5 ALGORITHM AND ITS GENERALIZATION ABILITY neuralnetworkdistancetoanalyzethegeneralizationability The minimization problems (6) and (7) can be rewrittenby for the GAN. However, this neuralnetwork distance is not replacingtheexpectationwithagivensetofexamplesXm = a good choice here, as it is not related with the objectives {x1,··· ,xm} and the noise vectors Zm = {z1,··· ,zm} that are used to train the LS-GAN. Thus the generalization drawnfromadistributionPz(z). ability in terms of this neural network distance does not Thisresultsinthefollowingtwoproblems. imply the LS-GAN could also generalize. Instead, a direct generalizationanalysisisrequiredfortheLS-GANinterms m 1 mθinSm(θ,φ∗), mXi=1Lθ(xi) (8) ofitLseotwuns ocobnjescitdiverest.he generalization in terms of S(θ,φ∗) λ m first.ThisobjectiveisusedtotrainalossfunctionLθ.Thus, + m ∆(xi,Gφ∗(zi))+Lθ(xi)−Lθ(Gφ∗(zi)) + it will tell us if a trained loss function would generalize. Xi=1(cid:0) (cid:1) Considerthetrueobjective and S =minS(θ,φ∗) k 1 θ mφinTk(θ∗,φ)= k Xi=1Lθ∗(Gφ(z′i)) (9) andtheempiricalobjective TherandomvectorsZk′ = {z′i|i = 1,··· ,k}usedin(9)can Sm =mθinSm(θ,φ∗), bedifferentfromZm usedin(8). The loss function and the generator can be learned by givenafixedgeneratorGφ∗2. alternatingbetweenthesetwoproblemsovermini-batches. We wish to show if and how their difference |Sm −S| In each mini-batch, a set Zm of random noises are sam- would be bounded as the number m of samples grows. If pled from a prior distribution Pz(z), along with a subset the LS-GAN generalizes, the difference should converge in of real samples from the training set Xm. Then, the loss probabilitytozeroasamoderatenumberofsamplescome. function is updated by descending the gradient of (8), and Otherwise, if the generalization failed, Sm would have a the generator is updated by minimizing (9) with a set of nonvanishinggaptoS,implyingthemodelisover-fittedto randomvectorsZk′ sampledfromPz(z).Afterthegenerator the empirical samples and could not generalize to the real Gφ∗ has been updated, the data points x(m+j) = Gφ∗(zj) distributionPdata.Frompracticalpointofview,thismeans of generated samples will also be updated. Algorithm 1 themodelmerelymemorizedthegivenexamples,unableto summarizesthelearningalgorithmfortheLS-GAN. producenewdatapointsfromtherealdistribution. To establish the above notation of generalization, we need the following assumption about the space of loss 5.1 GeneralizationAbility functionsandtheirdomain. Wehaveprovedthedensityofgeneratedsamplesisconsis- Assumption 2. Wehavethefollowingassumptionstoestablish tent with the real data density in Theorem 1. This consis- thegeneralizationabilityforLS-GAN. tencyis establishedon thatthe LS-GAN istrained bycom- putingtheexpectationoverrealdistributionsPdata andPG I. The loss function Lθ(x) is κL-Lipschitzin its parameter intwoadversarialobjectives(6)and(7).Unfortunately,ina θ,i.e.,|Lθ(x)−Lθ′(x)|≤κLkθ−θ′kforanyx; practicalalgorithm,thesepopulationexpectationscannotbe II. We also assume that Lθ(x) is κ-Lipschitz in x, i.e., calculateddirectly; instead, theycanonly be approximated |Lθ(x)−Lθ(x′)|≤κkx−x′k; overempiricaldistributionsonafinitesetoftrainingexam- III. The distance between two samples is bounded, i.e., plesasin(8)and(9). |∆(x,x′)|≤B . ∆ This raises the concern about the generalizabilityof the Thenwecanprovethefollowingtheorem. LS-GAN model. In other words, we wonder, with more training examples available, whether the model trained Theorem 2. Under Assumption 2, with probability 1−η, we over the empirical distributions can converge to the oracle have model that would be learned from the real distributions. |S −S|≤ε m If it generalizes, we also wish to estimate the sample com- plexity characterizing how many examples are required to whenthenumberofsamples sufficiently bound the performance difference between the CNB2(κ+1)2log(κ N/ηε) empiricalandoraclemodels. m≥ ∆ L , ε2 To this end, first we need to specify the notion of the generalization ability for LS-GAN. The objectives S(θ,φ∗) whereC is asufficiently largeconstant, and N is the numberof and T(θ∗,φ) of the LS-GAN are natural choices: S(θ,φ∗) parametersinthelossfunction. measuresthequalityofalossfunctionindistinguishingbe- TheproofofthistheoremisgiveninAppendixE. tweenrealandgeneratedsamples,whileT(θ∗,φ)measures to what extent a generator can be trained to minimize the 2.LearningGφ∗ isduetotheotherobjectivefunction. PREPRINT 6 Similarly, we can derive the generalizability in terms of the other objective T(θ,φ) used to train the generator functionbyconsidering Tk =mφinTk(θ∗,φ) (cid:2)(cid:6)(cid:4)∗ and T =minT(θ∗,φ) φ (cid:2) (cid:4)∗ overempiricalandrealdistributions. (cid:2)(cid:3) Wealsoneedthefollowingassumptionsonthespaceof (cid:4)∗ generatorfunctionsandtheirdomains,whicharesymmetric toAssumption2. Assumption 3. Weassumethat x(1) x(2) x(3) x(4) I. The generator function Gφ(x) is ρG-Lipschitz in its e b panaryazm;eter φ, i.e., |Gφ(z)−Gφ′(z)| ≤ ρGkφ−φ′k for iFnigF.κ2.foCroLmSp-aGrAisNon. TbheetwyeaernetuwpopeorptaimndallolowsesrfbuonucntidosnsofLtθh∗eacnladsLsθo∗f optimal loss functions Lθ∗ to Problem (8). Both the upper and the II. Also, we have Gφ(z) is ρ-Lipschitz in z, i.e., |Gφ(z)− lowerboundsarecone-shaped,andhavenon-vanishinggradientalmost Gφ(z′)|≤ρkz−z′k; everywhere.Specifically,inthisone-dimensionalexample,bothbounds III. The samplesz’s drawnfrom Pz are bounded,i.e., kzk ≤ arepiecewiselinear,havingaslopeof±κalmosteverywhere. Bz. Then we can show the following theorem about the Theorem4. ThefollowingfunctionsLθ∗ andLθ∗ bothminimize generalizability in terms of T(θ,φ), following the similar Sm(θ,φ∗)inFκ: b e ideainprovingTheorem3. Theorem 3. Under Assumption 3, with probability 1−η, we Lθ∗(x)=1≤mi≤ax2m li∗−κ∆(x,x(i)) + , (cid:8)(cid:0) (cid:1) (cid:9) (10) have Lbθ∗(x)= min li∗+κ∆(x,x(i))} |Tk−T|≤ε 1≤i≤2m(cid:8) e with the parameters θ∗ = [l∗,··· ,l∗ ] ∈ Rn+m. They are whenthenumberofsamples 1 2m supportedintheconvexhullof{x(1),··· ,x(2m)},andwehave C′MB2κ2ρ2log(κ ρ M/ηε) k ≥ z ε2 L G , Lθ∗(x(i))=Lθ∗(x(i))=li∗ whereC′ isasufficientlylargeconstant,andM isthenumberof b e for i = 1,··· ,2m, i.e., their values coincide on parametersinthegeneratorfunction. {x(1),x(2),··· ,x(2m)}. Both theorems show the sample complexity to reach Theproofofthistheoremisgivenintheappendix. a certain level of generalization ability. For example, the Fromthetheorem,itisnothardtoshowthatanyconvex required number of samples m to generalize loss function combination of these two forms attains the same value of is proportional to NlogN, as well as the square of the Lipschitz constant κ. This implies we should control the Sn,m, and is also a global minimizer. Thus, we have the followingcorollary. samplecomplexityof trainingloss andgenerator functions by limiting not only their parametric sizes but also their Corollary1. Allthefunctionsin Lipschitz constants; the latter appearsto be more severe in causingoverfittingproblem. Lθ∗ ={γLθ∗ +(1−γ)Lθ∗|0≤γ ≤1}⊂Fκ b e aretheglobalminimizerofSm inFκ. 6 NON-PARAMETRIC ANALYSIS This shows that the global minimizer is not unique. Nowletuscharacterizetheoptimallossfunctionsbasedon Moreover, through the proof of Theorem 4, one can find the objective (8), which will provide us an insight into the that Lθ∗(x) and Lθ∗(x) are the upper and lower bound of LS-GANalgorithm. any optimal loss function solution to the problem (8). In e b We generalize the non-parametric maximum likelihood particular,wehavethefollowingcorollary. methodin[26]andconsidernon-parametricsolutionstothe optimallossfunctionbyminimizing(8)overthewholeclass Corollary 2. For any Lθ∗(x) ∈ Fκ that minimizes Sm, the ofLipschitzlossfunctions. corresponding Lθ∗(x) and Lθ∗(x) are the lower and upper Let x(1) = x1,x(2) = x2,··· ,x(m) = xm,x(m+1) = boundsofLθ∗(xb),i.e., e Gφ∗(z1),··· ,x(2m) = Gφ∗(zm), i.e., the first n data points are real examples and the rest m are generated samples. Lθ∗(x)≤Lθ∗(x)≤Lθ∗(x) Thenwehavethefollowingtheorem. b e TheproofisgiveninAppendixB. PREPRINT 7 The parametersθ∗ = [l∗,··· ,l∗ ] in (10) canbe sought 7 COMPARISON WITH WASSERSTEIN GAN 1 2m byminimizing We notice that the recently proposed Wasserstein GAN 1 m λ m (WGAN) [14] uses the Earth-Mover (EM) distance to ad- Sm(φ∗,θ), m li+ m ∆i,m+i+li−lm+i + dressthevanishinggradientandsaturatedJSdistanceprob- Xi=1 Xi=1(cid:0) (cid:1) lemsintheclassicGANbyshowingtheEMdistanceiscon- s.t.,|li−li′|≤κ∆(x(i),x(i′)) tinuousanddifferentiablealmosteverywhere.WhiletheLS- li ≥0, i,i′ =1,··· ,2m (11) GAN and the WGAN address these problems by different approachesthatareindependentlydeveloped,bothturnout where∆i,j isshortfor∆(x(i),x(j)),andtheconstraintsare to use the Lipschitz constraint to learn the loss function of imposed to ensure the learned loss functions stay in Fκ. the LS-GAN and the critic of the WGAN respectively. This Withagreatervalueofκ,alargerclassoflossfunctionwill constraint plays vitalbut different roles in the two models. besought.Thus,onecancontrolthemodelingabilityofthe IntheLS-GAN,theLipschitzconstraintonthelossfunction lossfunctionbysettingapropervaluetoκ. naturally arises from the Lipschitz regularity imposed on Problem (11) is a typical linear programming problem. the data density. Under this regularity condition, we have In principle, one can solve this problem to obtain a non- provedinTheorem1thatthedensityofgeneratedsamples parametric loss function for the LS-GAN. Unfortunately, it is Lipschitze and consistent with the underlying data den- consists of a large number of constraints, whose scale is sity. On the contrary, the WGAN introduces the Lipschitz 2m constraint from the Kantorovich-Rubinstein duality of the at an order of . This prevents us from using (11) (cid:18) 2 (cid:19) EM distance but it is unclear in [14] if the WGAN is also directly to solve an optimal LS-GAN model with a very based on the same Lipschitz regularity on the underlying largenumberoftrainingexamples. datadensity. However, a more tractable solution is to use a parame- HereweassertthattheWGANalsomodelsanunderly- terized network to solve this non-parametric optimization ingLipschitzdensity.Toprovethis,werestatetheWGANas problem (8) constrained in Lκ, and this is exactly the gra- follows.TheWGANseekstofindacriticf∗ andagenerator w dient descent method adopted in Algorithm 1 that itera- g∗ suchthat φ tivelyupdatesparameterizedLθ andGφ.Toensuretheloss functiontohaveaboundedLipschitzconstant,onecanuse fw∗ =arg max U(fw,gφ∗) weight decay to avoid too large value of network weights, fw∈F1 (12) ,E [f (x)]−E [f (g∗(z))] or directly clamp the weights to a bounded area. In this x∼Pdata w z∼Pz(z) w φ paper, we adopt weight decay and find it works well with and theLS-GANmodelinexperiments. Althoughthe non-parametric solution cannot be solved gφ∗ =argmaxV(fw∗,gφ),Ez∼Pz(z)[fw∗(gφ(z))] (13) dleiarerncteldy, bityisa vdaeleupabnleetwinorckh,arwachtiecrhizicnagntshheedlosssomfuenclitgiohnt Let Pgφ∗ be the density of samples generated by gφ∗. Then, we prove the following lemma about the WGAN in on how the LS-GAN is trained. It is well known that the AppendixC. trainingoftheclassicGANgeneratorsuffersfromvanishing gradientproblemasthediscriminatorcanbeoptimizedvery Lemma 3. Under Assumption 1, given an optimal solution quickly. Recent study [14] has revealed that this is caused (fw∗,gφ∗)totheWGANsuchthatPgφ∗ isLipschitz,wehave by using the Jensen-Shannon (JS) distance that becomes locally saturated and gets vanishing gradient to train the |Pdata(x)−Pg∗(x)|dx=0 Z φ GAN generatorif the discriminator is over-trained. Similar x problem has also been found in the energy-based GAN ThislemmashowsboththeLS-GANandtheWGANare (EBGAN) [11] as it minimizes the total variation that is basedonthesameLipschitzregularitycondition. not continuous or (sub-)differentiable if the corresponding Althoughbothmethodsare derivedfrom verydifferent discriminatorisfullyoptimized[14]. perspectives,itisinterestingtomakeacomparisonbetween On the contrary, as revealed in Theorem 4 and illus- their respective forms. Formally, the WGAN seeks to max- tratedinFigure 2, boththe upperand lowerbounds of the imizethe difference betweenthe first-ordermomentsof fw optimal LS-GAN loss functions are cone-shaped (in terms under the densities of realand generated examples. In this of ∆(x,x(i)) that defines the Lipschitz continuity), and sense, the WGAN can be considered as a kind of first-order havenon-vanishinggradientalmosteverywhere.Moreover, momentmethod. Numerically, asshownin the secondterm Problem (11) only contains linear objective and constraints; ofEq.(12),fw tendstobeminimizedtobearbitrarilysmall thisiscontrarytotheclassicGANthatinvolveslogisticloss over generated samples, which could make U(fw,gφ∗) be terms that are prone to saturationwithvanishinggradient. unboundedabove.ThisiswhytheWGANmustbetrained Thus, an optimal loss function that is properly sought in byclippingthenetworkweightsoffw onaboundedboxto Lκ as shown in Figure 2 is unlikely to saturate between preventU(fw,gφ∗)frombecomingunboundedabove. thesetwobounds,anditshouldbeabletoprovidesufficient On the contrary, the LS-GAN treats real and generated gradienttoupdatethegeneratorbydescending(9)evenifit examples in pairs, and maximizes the difference of their hasbeentrainedtilloptimality.Ourexperimentalsoshows lossesuptoadata-dependantmargin.Specifically,asshown that,evenifthelossfunctionisquicklytrainedtooptimality, in the second term of Eq. (6), when the loss of a generated itcanstillprovidesufficientgradienttocontinuouslyupdate sample zG becomes too large wrt that of a paired real thegeneratorintheLS-GAN(seeFigure5). example x, the maximization of Lθ(zG) will stop if the PREPRINT 8 Algorithm1LearningalgorithmforLS-GAN. functionLθ(x,y)tomeasurethedegreeofthemisalignment Input:mdataexamplesXm,andλ. betweenadatasamplexandagivenconditiony. foranumberofiterationsdo Forarealexamplexalignedwiththeconditiony,itsloss foranumberofstepsdo functionshouldbe smallerthanthatofageneratedsample \\Updatethelossfunctionoverminibatches; byamarginof∆(x,Gφ(z,y)).Thisresultsinthefollowing SampleaminibatchfromXm; constraint, SampleaminibatchfromZm; Lθ(x,y)≤Lθ(Gφ(z,y),y)−∆(x,Gφ(z,y)) (14) UpdatethelossfunctionLθ bydescendingthegradi- entof(8)withweightdecayovertheminibatches; Like the LS-GAN, this type of constraint yields the endfor followingnon-zero-sumgametotraintheCLS-GAN,which SampleasetofZ′ ofkrandomnoises; seeksaNashequilibrium(θ∗,φ∗)sothatθ∗ minimizes k UpdatethegeneratorGφ bydescendingthegradientof S(θ,φ∗)= E Lθ(x,y) (15) (9)withweightdecay; x,y∼Pdata(x,y) Update the generated samples x(m+j) = Gφ∗(zj) for +λ E ∆(x,Gφ∗(z,y))+Lθ(x,y) j =1,··· ,monZm; x,y∼Pdata(x,y)(cid:0) endfor z∼Pz(z) −Lθ(Gφ∗(z,y),y) + (cid:1) andφ∗ minimizes differenceLθ(zG)−Lθ(x)exceeds∆(x,zG).Thisprevents the minimizationproblem (6) unbounded below, makingit T(θ∗,φ)= E Lθ∗(Gφ(z,y),y) (16) betterposedtosolve. y∼Pdata(y) z∼Pz(z) Moreimportantly,paringrealandgeneratedsamplesin (·) preventstheirlossesfrom beingdecomposed intotwo Playing the above game will lead to a trained pair of + separate first-order moments like in the WGAN. Instead, lossfunctionLθ∗ andgeneratorGφ∗.Wecanshowthatthe theLS-GANmakespairwisecomparisonbetweenthelosses learned generator Gφ∗(z,y) can produce samples whose of real and generated samples, thereby enforcing real and distribution follows the true data density Pdata(x|y) for a givenconditiony. generated samples to coordinate with each other to learn the optimal loss function. Specifically, when a generated Toprovethis,wesayalossfunctionLθ(x,y)isLipschitz if it is Lipschitz continuous in its first argument x. We also sample becomes close to a paired real example, the LS- GAN will stop increasing the difference Lθ(zG) − Lθ(x) imposethefollowingregularityconditionontheconditional between their losses. This allows the LS-GAN to focus on densityPdata(x|y). improvedpoorsamplesthatarefarapartfromthemanifold Assumption 4. Foreachy,theconditionaldensityPdata(x|y) ofrealexamples,insteadofwastingitsmodelingcapacityon isLipschitz,andissupportedinaconvexcompactsetofx. those well generated samples that are already close to the Then it is not difficult to prove the following theorem, manifold of real examples. This makes the LS-GAN more efficientininvestingitsgenerativeabilityoversamples. whichshowsthattheconditionaldensityPG∗(x|y)becomes Finally, in Appendix D, we discuss a Generalized LS- Pdata(x|y)asλ→+∞.HerePG∗(x|y)denotesthedensity GAN (GLS-GAN) model, and show that both WGAN and of samples generated by Gφ∗(z,y) with sampled random noisez. LS-GAN are simply two special cases of this GLS-GAN withasuitablecostfunctionreplacing(·)+ intheminimiza- Theorem 5. UnderAssumption4,aNashequilibrium(θ∗,φ∗) tion problem(6). Thisunifies thesetwo known regularized existssuchthat GANs,andthereshouldexistsomemoresweetspotamong (i)Lθ∗(x,y)isLipschitzcontinuousinxforeachy; these GLS-GANs. Although most of theory andalgorithms (ii)PG∗(x|y)isLipschitzcontinuous; for these GLS-GANs have already been developed for the 2 LS-GAN,weleaveacomprehensivestudyoftheGLS-GAN (iii) x|Pdata(x|y)−PG∗(x|y)|dx ≤ λ. R familyinourfuturework. In addition, similar upper and lower bounds can be derivedtocharacterizethelearnedconditionallossfunction 8 CONDITIONAL LS-GAN Lθ(x,y)followingthesameideaforLS-GAN. A useful byproduct of the CLS-GAN is one can use the TheLS-GANcaneasilybegeneralizedtoproduceasample learned loss function Lθ∗(x,y) to predict the label of an based on a given condition y, yielding a new paradigm of examplexby ConditionalLS-GAN(CLS-GAN). y∗ =argminLθ∗(x,y) (17) Forexample,iftheconditionisanimageclass,theCLS- y GANseekstoproduceimagesofthegivenclass;otherwise, The advantage of such a CLS-GAN classifier is it is if a text description is given as a condition, the model trainedwithbothlabeledandgeneratedexamples,thelatter attempts to generate images aligned with the given de- of which can improve the training of the classifier by re- scription. This gives us more flexibility in controlling what vealingmorepotentialvariationswithindifferentclassesof samplestobegenerated. samples.Italsoprovidesawaytoevaluatethemodelbased Formally, the generator of CLS-GAN takes a condition onitsclassificationperformance.Thisisanobjectivemetric vector y as input along with a noise vector z to produce we can use to assess the quality of feature representations a sample Gφ(z,y). To train the model, we define a loss learnedbythemodel. PREPRINT 9 Foraclassificationtask,asuitablevalueshouldbesetto turnimprovetheguessoverthetrainingcourse.Theexper- λ. Although Theorem 5 shows PG∗ would converge to the imentsinthefollowingsectionwillshowthatthisapproach true conditional density Pdata by increasing λ, it only en- cangenerateverycompetitiveperformanceespeciallywhen suresitisagoodgenerativeratherthanclassificationmodel. thelabeleddataisverylimited. However, a too large value of λ tends to ignore the first loss minimizationterm of (15) that plays animportant role 9 EXPERIMENTS in minimizing classification error. Thus, a trade-off should be made to balance between classification and generation Objective evaluation of a data generative model is not an objectives. easy task as there is no consensus criteria to quantify the quality of generated samples. For this reason, we make a qualitative analysis of generated images, and use image 8.1 Semi-SupervisedLS-GAN classification as a surrogate to quantitatively evaluate the The above CLS-GAN can be considered as a fully super- resultantLS-GANmodel. vised model to classify examples into different classes. It First, we will assess the generated images by the un- canalsobeextendedtoaSemi-Supervisedmodelbyincor- conditional LS-GAN, and compare it with the other state- poratingunlabeledexamples. of-the-art GAN model. Then, we will make an objective Suppose we have c classes indexed by {1,2,··· ,c}. In evaluation by using the learned loss function in the CLS- theCLS-GAN,foreachclass,wechoosealossfunctionthat, GAN to classify images. This task evaluates the quality of forexample,canbedefinedasthenegativelog-softmax, the feature representations extracted by the CLS-GAN in exp(a (x)) termsofitsclassificationaccuracydirectly.Wewillcompare L (x,y=l)=−log l θ c exp(a (x)) it with the feature representations extracted by the other l=1 l deepgenerativenetworks. P where al(x) is the lth activation output from a network We will also conduct a qualitative evaluation of the layer. generated images by the CLS-GAN for different classes, Suppose we also have unlabeled examples available, and analyze the factors that would affect image genera- and we can define a new loss function for these unlabeled tion performance. It will give us an intuitive idea of why examplesso that theycan be involved in training the CLS- and how the CLS-GAN can capture class-invariant feature GAN. Consider an unlabeled example x, its groundtruth representations to classify and generate images for various labelisunknown.However,thebestguessofitslabelcanbe classes. madebychoosingtheonethatminimizesLθ(x,y=l)over l, and this inspires us to define the following loss function 9.1 Architectures fortheunlabeledexampleas Thedetailofthenetworkarchitectureweadoptfortraining Lul(x),minL (x,y=l) θ l θ CLS-GANonthesetwodatasetsispresentedinTable1. Specifically, we adopt the ideas behind the network Here we modify Lθ(x,y = l) to −log1+Peclx=p1(aexl(px()a)l(x)) so architecture for the DCGAN [7] to build the generator and 1+Pcl=1e1xp(al(x)) can be viewed as the probability that x thelossfunctionnetworks.Comparedwiththeconventional doesnotbelongtoanyknownlabel. CNNs,maxpoolinglayers arereplacedwithstridedconvo- Thenwehavethefollowing loss-sensitiveobjectivethat lutions in both networks, and fractionally-strided convolu- exploresunlabeledexamplestotraintheCLS-GAN, tionsareusedinthegeneratornetworktoupsamplefeature mapsacrosslayerstofinerresolutions.Batch-normalization Sul(θ,φ∗), (18) layers are added in both networks between convolutional E ∆(x,Gφ∗(z))+Luθl(x)−Luθl(Gφ∗(z)) + layers, and fully connected layers are removed from these x∼Pdata(x)(cid:0) (cid:1) networks. z∼Pz(z) However, unlike the DCGAN, the LS-GAN model (un- Thisobjectiveis combined withS(θ,φ∗)defined in(15) conditional version in Section 3) does not use a sigmoid totrainthelossfunctionnetworkbyminimizing layer as the output for the loss function network. Instead, S(θ,φ∗)+γSul(θ,φ∗) we remove it and directly output the activation before the removedsigmoidlayer.Thisis becausefor a unconditional where γ is a positive hyperparameterbalancing the contri- LS-GAN, wecandisregardthe firstlossminimizationterm butionsfromlabeledandlabeledexamples. (seetheremarkafterLemma1).Inthiscase,anyformofloss The idea of extending the GAN for semi-supervised functioncanbeadoptedwithoutnonnegativeconstraint. learninghasbeenproposedbyOdena [27]andSalimanset On the other hand, for the loss function network in al.[2],wheregeneratedsamplesareassignedtoanartificial CLS-GAN, a global mean-pooling layer is added on top of class, and unlabeled examples are treated as the negative convolutionallayers.Thisproducesa1×1featuremapthat examples. Our proposed semi-supervised learning differs is fed into a cross-entropy cost function to output the loss increatinganewlossfunctionforunlabeledexamplesfrom Lθ(x,y)conditionedonagivenclassy. the losses for existing classes, by minimizing which we In the generator network, Tanh is used to produce im- make the best guess of the classes of unlabeled examples. ageswhosepixelvaluesarescaledto[−1,1].Thus,allimage Theguessedlabeledwillprovideadditionalinformation to examples in datasets are preprocessed to have their pixel train the CLS-GAN model, and the updated model will in valuesin [−1,1].More details about the design of network PREPRINT 10 TABLE1 random vector as its input. We will train CLS-GAN as TheNetworkarchitectureusedinLS-GANfortrainingCIFAR-10and presented in Section 8 by involving both unlabeled and SVHN,whereBNstandsforbatchnormalization,LeakyReLUforLeaky labeled examples. This will be compared against the other Rectifierwithaslopeof0.2fornegativevalue,and“3c1s96oConv.” meansa3×3convolutionkernelwithstride1and96outputs,while state-of-the-art supervised deep generative models as well ”UpConv.”denotesthefractionally-strideconvolution. astheotherGANmodelsinliterature. (a) LossFunctionNetwork 9.3 GeneratedImagesbyLS-GAN Input32×32×3 Firstwemakeaqualitativecomparisonbetweentheimages generated by the DCGAN and the LS-GAN on the celebA 3c1s96oConv.BNLeakyReLU 3c1s96oConv.BNLeakyReLU dataset. 4c2s96oConv.BNLeakyReLU As illustrated in Figure 3, there is no perceptible differ- 3c1s192oConv.BNLeakyReLU encebetweenthequalitiesofgeneratedimagesbytwocom- 3c1s192oConv.BNLeakyReLU paredGANmodelsaftertheyaretrainedafter25epochs. 4c2s192oConv.BNLeakyReLU The DCGAN architecture has been exhaustively fine- 3c1s192oConv.BNLeakyReLU tunedinthecontextoftheclassicGANtrainingcriterionto 3c1s192oConv.BNLeakyReLU maximizethegenerationperformance. Itissusceptiblethat 1c1s192oConv.BNLeakyReLU its architecturecouldbe fragile ifwe makesome change to globalmeanpool it.Herewe testifthe LS-GANcanbemore robustthanthe Output32×32×10 DCGANwhenastructurechangeismade. For example, one of the most key components in the (b) GeneratorNetwork DCGAN is the batch normalization inserted between the Input100-Drandomvector+10-Done-hotvector fractionalconvolutionlayersinthegeneratornetwork.Ithas been reported in literature [2] that the batch normalization 4c1s512oUpConv.BNLeakyReLU 4c2s256oUpConv.BNLeakyReLU notonlyplaysakeyroleintrainingtheDCGANmodel,but 4c2s128oUpConv.BNLeakyReLU also prevents the mode collapse of the generator into few 4c2s3oUpConv.BNLeakyReLU datapoints. ElementwiseTanh TheresultsareillustratedinFigure4.Ifoneremovedthe Output32×32×3 batchnormalizationlayersfromthegenerator,theDCGAN would collapse without producing any face images. On the contrary, the LS-GAN still performs very well even if these batch normalization layers are removed, and there architecturescanbefoundinliterature[7].Table1showsthe is no perceived deterioration or mode collapse of the networkarchitecturefortheCLS-GANmodelonCIFAR-10 generated images. This shows that the LS-GAN is more and SVHN datasets in the experiments. In particular, the resilientagainstthearchitecturechangesthantheDCGAN. architecture of the loss function network is adapted from We also analyze the magnitude (ℓ norm) of the gener- 2 thatusedin[22]withninehiddenlayers. ator’s gradient (in logarithmic scale)in Figure 5 over itera- tions. Withthe loss functionbeingupdatedeveryiteration, 9.2 TrainingDetails the generator is only updated every 1, 3, and 5 iterations. In this way, we wish to study if the gradient to update The models are trained in a mini-batch of 64 images, and the generator would lessen or vanish if the loss function their weights are initialized from a zero-mean Gaussian isovertrained. distribution with a standard deviation of 0.02. The Adam From the figure, we note that the magnitude of the optimizer [28] is used to train the network with initial generator’s gradient, no matter how frequently the loss learningrate andβ beingset to10−3 and0.5respectively, 1 functionisupdated,graduallyincreasesuntilitstopsatthe while the learning rate is annealed every 25 epochs by a samelevel.Thisimpliestheobjectivefunctiontoupdatethe factorof0.8. generator tends to be linear rather than saturated through The hyperparameter γ and λ are chosen via a five-fold the training process, which is consistent with our analysis cross-validation. We also test different types of distance ofthenon-parametricsolutiontothe optimallossfunction. metricsforthelossmargin∆(·,·),andfindtheL distance 1 Thus,itprovidessufficientgradienttocontinuouslyupdate achieves better performance among the other compared thegenerator. choices like L distance. We also tried to use intermediate 2 feature maps from the loss function network to compute 9.4 ImageClassification the loss margin between images. Unfortunately we found the results were not as good as the L distance. The loss We conduct experiments on CIFAR-10 and SVHN to com- 1 margin∆(·,·) tends to reduce to zeroover epochs as these pare the classification accuracy of LS-GAN with the other intermediatefeaturemapswouldcollapsetoatrivialsingle approaches. point. For the generator network of LS-GAN, it takes a 100- 9.4.1 CIFAR-10 dimensional random vector drawn from Unif[−1,1] as in- The CIFAR dataset [33] consists of 50,000 training images put. For the CLS-GAN generator, an one-hot vector en- and10,000testimagesontenimagecategories.Wetestthe coding the image class is concatenated with the sampled proposed CLS-GAN model with class labels as conditions.