Improved Texture Networks: Maximizing Quality and Diversity in Feed-forward Stylization and Texture Synthesis DmitryUlyanov AndreaVedaldi SkolkovoInstituteofScienceandTechnology&Yandex UniversityofOxford [email protected] [email protected] VictorLempitsky 7 SkolkovoInstituteofScienceandTechnology 1 0 [email protected] 2 n a Abstract J 9 The recent work of Gatys et al., who characterized the ] style of an image by the statistics of convolutional neural V networkfilters,ignitedarenewedinterestinthetexturegen- C erationandimagestylizationproblems. Whiletheirimage . s generationtechniqueusesaslowoptimizationprocess, re- (a)Panda. (b) (c) c cently several authors have proposed to learn generator [ neural networks that can produce similar outputs in one 1 quickforwardpass. Whilegeneratornetworksarepromis- v ing, they are still inferior in visual quality and diversity 6 compared to generation-by-optimization. In this work, we 9 advance them in two significant ways. First, we introduce 0 2 aninstancenormalizationmoduletoreplacebatchnormal- (d)Style. (e) (f) 0 izationwithsignificantimprovementstothequalityofimage 1. stylization. Second, we improve diversity by introducing Figure 1: Which panda stylization seems the best to you? 0 a new learning formulation that encourages generators to Definitely not the variant (b), which has been produced 7 sampleunbiasedlyfromtheJulesztextureensemble,which byastate-of-the-artalgorithmamongmethodsthattakeno 1 istheequivalenceclassofallimagescharacterizedbycer- : longerthanasecond. The(e)picturetookseveralminutes v tainfilterresponses. Together,thesetwoimprovementstake to generate using an optimization process, but the quality i feed forward texture synthesis and image stylization much X isworthit,isn’tit? Wewouldbeparticularlyhappyifyou closer to the quality of generation-via-optimization, while r choseonefromtherightmosttwoexamples,whicharecom- a retainingthespeedadvantage. putedwithournewmethodthataspirestocombinethequal- ity of the optimization-based method and the speed of the fast one. Moreover, our method is able to produce diverse 1.Introduction stylizationsusingasinglenetwork. TherecentworkofGatysetal.[4,5], whichuseddeep linearconvolutionalneuralnetworkfiltersforthispurpose. neuralnetworksfortexturesynthesisandimagestylization Despite the excellent results, however, the matching pro- toagreateffect,hascreatedasurgeofinterestinthisarea. cess is based on local optimization, and generally requires Following an eariler work by Portilla and Simoncelli [15], aconsiderableamountoftime(tensofsecondstominutes) they generate an image by matching the second order mo- inordertogenerateasingletexturesorstylizedimage. mentsoftheresponseofcertainfiltersappliedtoareference Inordertoaddressthisshortcoming,Ulyanovetal.[19] textureimage. TheinnovationofGatysetal.istousenon- and Johnson et al. [8] suggested to replace the optimiza- tion process with feed-forward generative convolutional The source code is available at https://github.com/ DmitryUlyanov/texture_nets networks. In particular, [19] introduced texture networks 1 togeneratetexturesofacertainkind,asin[4],ortoapply was first studied by Julesz [9], who suggested that the vi- acertaintexturestyletoanarbitraryimage,asin[5]. Once sual system discriminates between different textures based trained, such texture networks operate in a feed-forward ontheaverageresponsesofcertainimagefilters. manner,threeordersofmagnitudefasterthantheoptimiza- The work of [20] formalized Julesz’ ideas by introduc- tionmethodsof[4,5]. ing the concept of Julesz ensemble. There, an image is The price to pay for such speed is a reduced perfor- a real function x : Ω R3 defined on a discrete lat- → mance. For texture synthesis, the neural network of [19] tice Ω = 1,...,H 1,...,W and a texture is a { } × { } generates good-quality samples, but these are not as di- distribution p(x) over such images. The local statistics verse as the ones obtained from the iterative optimization of an image are captured by a bank of (non-linear) filters method of [4]. For image stylization, the feed-forward re- F : Ω R,l=1,...,L,whereF (x,u)denotesthe l l X × → sults of [19, 8] are qualitatively and quantitatively worse responseoffilterF atlocationuonimagex. Theimagex l than iterative optimization. In this work, we address both ischaracterizedbythespatialaverageofthefilterresponses (cid:80) limitations by means of two contributions, both of which µ (x) = F (x,u)/Ω. Theimageisperceivedasa l u∈Ω l | | extendbeyondtheapplicationsconsideredinthispaper. particular texture if these responses match certain charac- Our first contribution (section 4) is an architectural teristicvaluesµ¯ . Formally,giventhelossfunction, l change that significantly improves the generator net- L works. The change is the introduction of an instance- (cid:88) (x)= (µ (x) µ¯ )2 (1) normalizationlayerwhich,particularlyforthestylization L l − l l=1 problem,greatlyimprovestheperformanceofthedeepnet- workgenerators.Thisadvancesignificantlyreducesthegap theJuleszensembleisthesetofalltextureimages instylisationqualitybetweenthefeed-forwardmodelsand the original iterative optimization method of Gatys et al., (cid:15) = x : (x) (cid:15) T { ∈X L ≤ } bothquantitativelyandqualitatively. that approximately satisfy such constraints. Since all tex- Our second contribution (section 3) addresses the lim- turesintheJuleszensembleareperceptuallyequivalent,itis iteddiversityofthesamplesgeneratedbytexturenetworks. naturaltorequirethetexturedistributionp(x)tobeuniform In order to do so, we introduce a new formulation that overthisset. Inpractice, itismoreconvenienttoconsider learns generators that uniformly sample the Julesz en- theexponentialdistribution semble [20]. The latter is the equivalence class of im- agesthatmatchcertainfilterstatistics. Uniformlysampling e−L(x)/T this set guarantees diverse results, but traditionally doing p(x)= (cid:82) , (2) e−L(y)/T dy so required slow Monte Carlo methods [20]; Portilla and Simoncelli, and hence Gatys et al., cannot sample from whereT >0isatemperatureparameter.Thischoiceismo- this set, but only find individual points in it, and possibly tivated as follows [20]: since statistics are computed from just one point. Our formulation minimizes the Kullback- spatial averages of filter responses, one can show that, in Leibler divergence between the generated distribution and the limit of infinitely large lattices, the distribution p(x) is a quasi-uniform distribution on the Julesz ensemble. The zero outside the Julesz ensemble and uniform inside. In learning objective decomposes into a loss term similar to this manner, eq. (2) can be though as a uniform distribu- Gatysetal.minustheentropyofthegeneratedtexturesam- tion over images that have a certain characteristic filter re- ples, whichweestimateinadifferentiablemannerusinga sponsesµ¯ =(µ¯ ,...,µ¯ ). 1 L non-parametricestimator[12]. Notealsothatthetextureiscompletelydescribedbythe We validate our contributions by means of extensive filter bank F = (F ,...,F ) and their characteristic re- 1 L quantitative and qualitative experiments, including com- sponsesµ¯. Asdiscussedbelow,thefilterbankisgenerally paring the feed-forward results with the gold-standard fixed, so in this framework different textures are given by optimization-based ones (section 5). We show that, com- differentcharacteristicsµ¯. bined,theseideasdramaticallyimprovethequalityoffeed- forward texture synthesis and image stylization, bringing Generation-by-minimization. Foranyinterestingchoice them to a level comparable to the optimization-based ap- ofthefilterbankF,samplingfromeq.(2)isratherchalleng- proaches. ingandclassicallyaddressedbyMonteCarlomethods[20]. In order to make this framework more practical, Portilla 2.Backgroundandrelatedwork andSimoncelli[16]proposedinsteadtoheuristicallysam- Julesz ensemble. Informally, a texture is a family of vi- plefromtheJuleszensemblebytheoptimizationprocess sual patterns, such as checkerboards or slabs of concrete, x∗ =argmin (x). (3) thatsharecertainlocalstatisticalregularities. Theconcept L x∈X If this optimization problem can be solved, the minimizer While this approach works well in practice, it shares the x∗ is by definition a texture image. However, there is no same important limitation as the original work of Portilla reasonwhythisprocessshouldgeneratefairsamplesfrom and Simoncelli: there is no guarantee that samples gener- the distribution p(x). In fact, the only reason why eq. (3) ated by g∗ would be fair samples of the texture distribu- maynotsimplyreturnalwaysthesameimageisthattheop- tion(2). Inpractice,asweshowinthepaper,suchsamples timization algorithm is randomly initialized, the loss func- tendinfacttobenotdiverseenough. tionishighlynon-convex,andsearchislocal.Onlybecause Both[8,19]havealsoshownthatsimilargeneratornet- ofthiseq.(3)maylandondifferentsamplesx∗ondifferent worksworkalsoforstylization. Inthiscase, thegenerator runs. g(x ,z) is a function of the content image x and of the 0 0 randomnoisez. Thenetworkg islearnedtominimizethe sumoftexturelossandthecontentloss: Deep filter banks. Constructing a Julesz ensemble re- quireschoosingafilterbankF.Originally,researcherscon- g∗ =argmin E [ (g(x ,z))+α (g(x ,z),x )]. sideredtheobviouscandidates: Gaussianderivativefilters, g px0,pz L 0 Lcont. 0 0 Gaborfilters,wavelets,histograms,andsimilar[20,16,21]. (5) Morerecently,theworkofGatysetal.[4,5]demonstrated thatmuchsuperiorfiltersareautomaticallylearnedbydeep convolutional neural networks (CNNs) even when trained Alternativeneuralgeneratormethods. Therearemany for apparently unrelated problems, such as image classifi- other techniques for image generation using deep neural cation. Inthispaper, inparticular, wechoosefor (x)the networks. L stylelossproposedby[4].Thelatteristhedistancebetween TheJuleszdistributioniscloselyrelatedtotheFRAME theempiricalcorrelationmatricesofdeepfilterresponsesin maximumentropymodelof[21],aswellastotheconcept aCNN.1 ofMaximumMeanDiscrepancy(MMD)introducedin[7]. BothFRAMEandMMDmaketheobservationthataprob- ability distribution p(x) can be described by the expected Stylization. The texture generation method of Gatys et values µ = E [φ (x)] of a sufficiently rich set of α x∼p(x) α al.[4]canbeconsideredasadirectextensionofthetexture statisticsφ (x). Buildingontheseideas, [14,3]construct α generation-by-minimization technique (3) of Portilla and generatorneuralnetworksgwiththegoalofminimizingthe Simoncelli [16]. Later, Gatys et al. [5] demonstrated that discrepancybetweenthestatisticsaveragedoverabatchof the same technique can be used to generate an image that generatedimages(cid:80)N φ (g(z ))/N andthestatisticsav- mixesthestatisticsoftwootherimages,oneusedasatex- eraged over a traningi=s1etα(cid:80)M iφ (x )/M. The resulting ture template and one used as a content template. Content i=1 α i networksgarecalledMomentMatchingNetworks(MMN). is captured by introducing a second loss (x,x ) that Lcont. 0 An important alternative methodology is based on the comparestheresponsesofdeepCNNfiltersextractedfrom concept of Generative Adversarial Networks (GAN; [6]). thegeneratedimagexandacontentimagex . Minimizing 0 This approach trains, together with the generator network thecombinedloss (x)+α (x,x )yieldsimpressive L Lcont. 0 g(z),asecondadversarialnetworkf()thatattemptstodis- artisticimages,whereatextureµ¯,definingtheartisticstyle, · tinguishbetweengeneratedsamplesg(z),z (0,I)and isfusedwiththecontentimagex0. realsamplesx p (x). Theadversarialm∼odNelf canbe data ∼ used as a measure of quality of the generated samples and Feed-forwardgeneratornetworks. Forallitssimplicity usedtolearnabettergeneratorg.GANarepowerfulbutno- and efficiency compared to Markov sampling techniques, toriouslydifficulttotrain. Alotofresearchishasrecently generation-by-optimization (3) is still relatively slow, and focused on improving GAN or extending it. For instance, certainlytooslowforreal-timeapplications. Therefore, in LAPGAN[2]combinesGANwithaLaplacianpyramidand thepastfewmonthsseveralauthors[8,19]haveproposedto DCGAN[17]optimizesGANforlargedatasets. learngeneratorneuralnetworksg(z)thatcandirectlymap random noise samples z pz = (0,I) to a local mini- 3.Juleszgeneratornetworks ∼ N mizerofeq.(3). Learningtheneuralnetworkgamountsto minimizingtheobjective This section describes our first contribution, namely a methodtolearnnetworksthatdrawsamplesfromtheJulesz g∗ =argminE (g(z)). (4) ensemble modelling a texture (section 2), which is an in- g pzL tractable problem usually addressed by slow Monte Carlo methods[21,20]. Generation-by-optimization,popularized 1Note that such matrices are obtained by averaging local non-linear by Portilla and Simoncelli and Gatys et al., is faster, but filters:thesearetheouterproductsoffiltersinacertainlayerofthenerual network.Hence,thestylelossofGatysetal.isinthesameformaseq.(1). canonlyfindonepointintheensemble,notsamplefromit, withscarcesamplediversity,particularlywhenusedtotrain An energy term similar to (6) was recently proposed feed-forwardgeneratornetworks[8,19]. in [10] for improving the diversity of a generator network Here,weproposeanewformulationthatallowstotrain inaadversariallearningscheme. Whiletheideaissuperfi- generatornetworksthatsampletheJuleszensemble,gener- cially similar, the application (sampling the Julesz ensem- atingimageswithhighvisualfidelityaswellashighdiver- ble) and instantiation (the way the entropy term is imple- sity. mented)areverydifferent. A generator network [6] maps an i.i.d. noise vector z (0,I) to an image x = g(z) in such a way that Learning objective. We are now ready to define an ob- ∼ N x is ideally a sample from the desired distribution p(x). jectivefunctionE(g)tolearnthegeneratornetworkg.This Such generators have been adopted for texture synthesis is given by substituting the expected loss (7) and the en- in [19], but without guarantees that the learned generator tropyestimator(9),computedoverabatchofN generated g(z)wouldindeedsampleaparticulardistribution. images,intheKLdivergence(6): Here, we would like to sample from the Gibbs distri- bution (2) defined over the Julesz ensemble. This distri- N butioncanbewrittencompactlyasp(x) = Z−1e−L(x)/T, E(g)= 1 (cid:88)(cid:104)1 (g(z )) where Z = (cid:82) e−L(x)/T dx is an intractable normalization N TL i i=1 constant. (cid:105) λlnmin g(z ) g(z ) (10) Denote by q(x) the distribution induced by a generator − j(cid:54)=i (cid:107) i − j (cid:107) networkg. Thegoalistomakethetargetdistributionpand thegeneratordistributionqascloseaspossiblebyminimiz- The batch itself is obtained by drawing N samples ingtheirKullback-Leibler(KL)divergence: z1,...,zn (0,I) from the noise distribution of the ∼ N generator. The first term in eq. (10) measures how closely (cid:90) q(x)Z KL(q p)= q(x)ln dx thegeneratedimagesg(zi)aretotheJuleszensemble. The || p(x) secondtermquantifiesthelackofdiversityinthebatchby 1 mutuallycomparingthegeneratedimages. = E (x)+ E lnq(x)+ln(Z) (6) T x∼q(x)L x∼q(x) = 1 E (x) H(q)+const. Learning. The loss function (10) is in a form that al- T x∼q(x)L − lowsoptimizationbymeansofStochasticGradientDescent (SGD).Thealgorithmsamplesabatchz ,...,z atatime Hence,theKLdivergenceisthesumoftheexpectedvalue 1 n andthendescendsthegradient: ofthestyleloss andthenegativeentropyofthegenerated L distributionq. N Thefirsttermcanbeestimatedbytakingtheexpectation 1 (cid:88)(cid:104) d dg(zi) L overgeneratedsamples: N dx(cid:62) dθ(cid:62) i=1 x∼Eq(x)L(x)=z∼NE(0,I)L(g(z)). (7) − ρλi(g(zi)−g(zji∗))(cid:62)(cid:18)ddgθ(z(cid:62)i) − dgd(θz(cid:62)ji∗)(cid:19)(cid:105) (11) Thisissimilartothereparametrizationtrickof[11]andis alsousedin[8,19]toconstructtheirlearningobjectives. whereθisthevectorofparametersoftheneuralnetworkg, The second term, the negative entropy, is harder to es- thetensorimagexhasbeenimplicitlyvectorizedandji∗ is timate accurately, but simple estimators exist. One which theindexofthenearestneighbourofimageiinthebatch. isparticularlyappealinginourscenarioistheKozachenko- 4.Stylizationwithinstancenormalization Leonenkoestimator[12]. Thisestimatorconsidersabatch of N samples x ,...,x q(x). Then, for each sample 1 n ∼ The work of [19] showed that it is possible to learn x , it computes the distance ρ to its nearest neighbour in i i high-qualitytexture networksg(z) thatgenerate imagesin thebatch: a Julesz ensemble. They also showed that it is possible to ρ =min x x . (8) i j(cid:54)=i (cid:107) i− j(cid:107) learn good quality stylization networks g(x0,z) that apply thestyleofafixedtexturetoanarbitrarycontentimagex . Thedistancesρ canbeusedtoapproximatetheentropyas 0 i Nevertheless, the stylization problem was found to be follows: N harder than the texture generation one. For the stylization D (cid:88) H(q)≈ N lnρi+const. (9) task,theyfoundthatlearningthemodelfromtoomanyex- i=1 amplecontentimagesx0,saymorethan16,yieldedpoorer whereD = 3WH isthenumberofcomponentsoftheim- qualitativeresultsthanusingasmallernumberofsuchex- agesx R3×W×H. amples. Someofthemostsignificanterrorsappearedalong ∈ theborderofthegeneratedimages,probablyduetopadding and other boundary effects in the generator network. We conjecturedthatthesearesymptomsofalearningproblem toodifficultfortheirchoiceofneuralnetworkarchitecture. Asimpleobservationthatmaymakelearningsimpleris thattheresultofstylizationshouldnot, ingeneral, depend Figure2:Comparisonofnormalizationtechniquesinimage onthecontrastofthecontentimagebutrathershouldmatch stylization. From left to right: BN, cross-channel LRN at thecontrastofthetexturethatisbeingappliedtoit. Thus, thefirstlayer,INatthefirstlayer,INthroughout. the generator network should discard contrast information 16.0 16.5 inthecontentimagex0. Wearguethatlearningtodiscard 15.5 Batch 16.0 Batch contrastinformationbyusingstandardCNNbuildingblock 15.0 Instance 15.5 Instance isunnecessarilydifficult,andisbestdonebyaddingasuit- (x)L14.5 (x)L1145..50 Random ablelayertothearchitecture. ln 14.0 ln 14.0 To see why, let x RN×C×W×H be an input tensor 13.5 13.5 containingabatchofN∈images. Letx denoteitsnijk- 13.00 3000 6000 9000 12000 13.00 50 100 150 200 250 300 nijk Iteration Iteration thelement,wherek andj spanspatialdimensions,iisthe (a)Feed-forwardhistory. (b)Finetuninghistory. feature channel (i.e. the color channel if the tensor is an RGBimage), andnistheindexoftheimageinthebatch. Then,contrastnormalizationisgivenby: x µ ynijk = (cid:112)nijk− ni, σ2 +(cid:15) ni W H 1 (cid:88) (cid:88) (c)Content. (d)StyleNetIN. (e)INfinetuned. µ = x , ni HW nilm (12) l=1m=1 W H 1 (cid:88) (cid:88) σ2 = (x µ )2. ni HW nilm− ni l=1m=1 Itisunclearhowsuchasfunctioncouldbeimplementedas asequenceofstandardoperatorssuchasReLUandconvo- (f)Style. (g)StyleNetBN. (h)BNfinetuned. lution. On the other hand, the generator network of [19] does Figure3: (a)learningobjectiveasafunctionofSGDitera- containanormalizationlayers,andpreciselybatchnormal- tionsforStyleNetINandBN.(b)Directoptimizationofthe ization(BN)ones. Thekeydifferencebetweeneq.(12)and Gatysetal.forthisexampleimagestartingfromthersultof batchnormalizationisthatthelatterappliesthenormaliza- StyleNetINandBN.(d,g)ResultofStyleNetwithinstance tiontoawholebatchofimagesinsteadofsingleones: (d)andbatchnormalization(g).(e,h)Resultoffinetuingthe Gatysetal.energy. x µ ynijk = (cid:112)niσjk2−+(cid:15)i, batchasawhole. Noteinparticularthatthismeansthatin- i stancenormalizationisappliedthroughoutthearchitecture, N W H 1 (cid:88)(cid:88) (cid:88) notjustattheinputimage—fig.2showsthebenefitofdoing µ = x , i HWN nilm (13) so. n=1l=1m=1 Another similarity with BN is that each IN layer is fol- N W H σ2 = 1 (cid:88)(cid:88) (cid:88)(x µ )2. lowedbyascalingandbiasoperators x+b.Adifference i HWN nilm− i isthattheINlayerisappliedattesttim(cid:12)easwell,unchanged, n=1l=1m=1 whereas BN is usually switched to use accumulated mean We argue that, for the purpose of stylization, the normal- andvarianceinsteadofcomputingthemoverthebatch. izationoperatorofeq.(12)ispreferableasitcannormalize IN appears to be similar to the layer normalization eachindividualcontentimagex0. method introduced in [1] for recurrent networks, although Whilesomeauthorscalllayereq.(12)contrastnormal- itisnotclearhowtheyhandlespatialdata. Liketheirs,IN ization, here we refer to it as instance normalization (IN) isagenericlayer,sowetesteditinclassificationproblems since we use it as a drop-in replacement for batch nor- aswell.Insuchcases,itstillworksurprisinglywell,butnot malization operating on individual instances instead of the aswellasbatchnormalization(e.g.AlexNet[13]INhas2- Content StyleNetIN(ours) StyleNetBN Gatysetal. Style Figure4: Stylizationresultsobtainedbyapplyingdifferenttextures(rightmostcolumn)todifferentcontentimages(leftmost column). Threemethodsarecompared: StyleNetIN,StyleNetBN,anditerativeoptimization. 3% worse top-1 accuracy on ILSVRC [18] than AlexNet residualarchitecturefrom[8]forallourstyletransferexper- BN). iments. We also experimented with architecture from [19] and observed a similar improvement with our method, but 5.Experiments use the one from [8] for convenience. We call it StyleNet withapostfixBNifitisequippedwithbatchnormalization In this section, after discussing the technical details of orINforinstancenormalization. themethod,weevaluateournewtexturenetworkarchitec- turesusinginstancenormalization,andtheninvestigatethe Fortexturesynthesiswecomparetwoarchitectures: the abilityofthenewformulationtolearndiversegenerators. multiscalefully-convolutionalarchitecturefrom[19](Tex- tureNetV1) and the one we design to have a very large re- 5.1.Technicaldetails ceptive field (TextureNetV2). TextureNetV2 takes a noise Network architecture. Among two generator network vector of size 256 and first transforms it with two fully- architectures,proposedpreviouslyin [19,8],wechoosethe connected layers. The output is then reshaped to a 4 4 × Input TextureNetV2 λ=0 TextureNetV2 λ>0(ours) TextureNetV1 λ=0 Figure5: ThetexturesgeneratedbythehighcapacityTextureNetV2withoutdiversityterm(λ = 0ineq.(10))areneary identical. ThelowcapacityTextureNetV1of [19]achievesdiversity,buthassometimespoorresults. TextureNetV2with diversityisthebestofbothworlds. image and repeatedly upsampled with fractionally-strided 5.2.Effectofinstancenormalization convolutions similar to [17]. More details can be found in In order to evaluate the impact of replacing batch nor- thesupplementarymaterial. malization with instance normalization, we consider first theproblemofstylization,wherethegoalistolearnagen- erator x = g(x ,z) that applies a certain texture style to 0 Weight parameters. In practice, for the case of λ > 0, the content image x0 using noise z as “random seed”. We entropylossandtexturelossineq.(10)shouldbeweighted setλ = 0forwhichgeneratorismostlikelytodiscardthe properly. As only the value of Tλ is important for opti- noise. mization we assume λ = 1 and choose T from the set of TheStyleNetINandStyleNetBNarecomparedinfig.3. three values (5,10,20) for texture synthesis (we pick the Panel fig. 3.a shows the training objective (5) of the net- highervalueamongthosenotleadingtoartifacts–seeour works as a function of the SGD training iteration. The discussionbelow). WefixT = 10000forstyletransferex- objective function is the same, but StyleNet IN converges periments.Fortexturesynthesis,similarlyto[19],wefound much faster, suggesting that it can solve the stylization usefultonormalizegradientofthetexturelossasitpasses problem more easily. This is confirmed by the stark dif- backthroughtheVGG-19network. Thisallowsrapidcon- ferenceinthequalitativeresultsinpanels(d)end(g). Since vergenceforstochasticoptimizationbutimplicitlyaltersthe theStyleNetsaretrainedtominimizeinoneshotthesame objectivefunctionandrequirestemperaturetobeadjusted. objective as the iterative optimization of Gatys et al., they Weobservethatfortextureswithflatlightninghighentropy canbeusedtoinitializethelatteralgorithm.Panel(b)shows weightresultsinbrightnessvariationsovertheimagefig.7. theresultofapplyingtheGatysetal.optimizationstarting Wehypothesizethisissuecanbesolvedifeithermoreclever from their random initialization and the output of the two distanceforentropyestimationisusedoranimagepriorin- StyleNets. Clearly both networks start much closer to an troduced. optimum than random noise, and IN closer than BN. The Content Style StyleNet λ>0 StyleNetλ=0 Figure6: TheStyleNetV2g(x ,z), trainedwithdiversityλ > 0, generatessubstantiallydifferentstylizationsfordifferent 0 valuesoftheinputnoisez. Inthiscase,thelackofstylizationdiversityisvisibleinuniformregionssuchasthesky. 5.3.Effectofthediversityterm HavingvalidatedtheIN-basedarchitecture,weevaluate now the effect of the entropy-based diversity term in the objectivefunction(10). The experiment in fig. 5 starts by considering the problem of texture generation. We compare the new high-capacity TextureNetV2 and the low-capacity Tex- tureNetsV1 texture synthesis networks. The low-capacity model is the same as [19]. This network was used there Figure7: Negativeexamples. Ifthediversitytermλistoo in order to force the network to learn a non-trivial depen- high for the learned style, the generator tends to generate dency on the input noise, thus generating diverse outputs artifacts in which brightness is changed locally (spotting) even though the learning objective of [19], which is the insteadof(oraswellas)changingthestructure. same as eq. (10) with diversity coefficient λ = 0, tends to suppress diversity. The results in fig. 5 are indeed diverse, but sometimes of low quality. This should be contrasted differenceisqualitativelylarge: panels(e)and(h)showthe with TextureNetV2, the high-capacity model: its visual fi- changeintheStyleNetsoutputafterfinetuningbyiterative delityismuchhigher,but,byusingthesameobjectivefunc- optimizationoftheloss,whichhasasmalleffectfortheIN tion[19],thenetworklearnstogenerateasingleimage,as variant,andamuchlargeronefortheBNone. expected. TextureNetV2 with the new diversity-inducing objective (λ > 0) is the best of both worlds, being both Similar results apply in general. Other examples are high-qualityanddiverse. showninfig.4, wheretheINvariantisfarsuperiortoBN Theexperimentinfig.6assessestheeffectofthediver- andmuchclosertotheresultsobtainedbythemuchslower sityterminthestylizationproblem. Theresultsaresimilar iterativemethodofGatysetal.StyleNetsaretrainedonim- totheonesfortexturesynthesisandthediversitytermeffec- agesofafixedsized,butsincetheyareconvolutional,they tivelyencouragesthenetworktolearntoproducedifferent canbeappliedtoarbitrarysizes. Inthefigure, thetoptree resultsbasedontheinputnoise. images are processed at 512 512 resolution and the bot- × One difficultly with texture and stylization networks is tomtwoat1024 1024. Ingeneral, wefoundthathigher × that the entropy loss weight λ must be tuned for each resolutionimagesyieldvisuallybetterstylizationresults. learned texture model. Choosing λ too small may fail to While instance normalization works much better than learnadiversegenerator,andsettingittoohighmaycreate batchnormalizationforstylization,fortexturesynthesisthe artifacts,asshowninfig.7. two normalization methods perform equally well. This is consistent with our intuition that IN helps in normalizing 6.Summary the information coming from content image x , which is 0 highlyvariable,whereasitisnotimportanttonormalizethe Thispaperadvancesfeed-forwardtexturesynthesisand texture information, as each model learns only one texture stylization networks in two significant ways. It introduces style. instancenormalization, anarchitecturalchangethatmakes trainingstylizationnetworkseasierandallowsthetraining [16] J.PortillaandE.P.Simoncelli. Aparametrictexturemodel processtoachievemuchlowerlosslevels.Italsointroduces based on joint statistics of complex wavelet coefficients. anewlearningformulationfortraininggeneratornetworks IJCV,2000. 2,3 to sample uniformly from the Julesz ensemble, thus ex- [17] A.Radford,L.Metz,andS.Chintala. Unsupervisedrepre- plicitlyencouragingdiversityinthegeneratedoutputs. We sentationlearningwithdeepconvolutionalgenerativeadver- sarialnetworks. CoRR,abs/1511.06434,2015. 3,7 show that both improvements lead to noticeable improve- [18] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, mentsofthegeneratedstylizedimagesandtextures, while S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, keepingthegenerationruntimesintact. A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual References RecognitionChallenge. InternationalJournalofComputer Vision(IJCV),115(3):211–252,2015. 5 [1] L.J.Ba,R.Kiros,andG.E.Hinton. Layernormalization. [19] D. Ulyanov, V. Lebedev, A. Vedaldi, and V. S. Lempitsky. CoRR,abs/1607.06450,2016. 5 Texture networks: Feed-forward synthesis of textures and [2] E.L.Denton, S.Chintala, A.Szlam, andR.Fergus. Deep stylized images. In Proceedings of the 33nd International generativeimagemodelsusingalaplacianpyramidofadver- Conference on Machine Learning, ICML 2016, New York sarialnetworks. InNIPS,pages1486–1494,2015. 3 City, NY, USA, June 19-24, 2016, pages 1349–1357, 2016. [3] G.K.Dziugaite, D.M.Roy, andZ.Ghahramani. Training 1,2,3,4,5,6,7,8 generativeneuralnetworksviamaximummeandiscrepancy [20] S. C. Zhu, X. W. Liu, and Y. N. Wu. Exploring texture optimization. InUAI,pages258–267.AUAIPress,2015. 3 ensembles by efficient markov chain monte carlotoward a [4] L. Gatys, A. S. Ecker, and M. Bethge. Texture synthesis atrichromacyotheoryoftexture. PAMI,2000. 2,3 using convolutional neural networks. In Advances in Neu- [21] S.C.Zhu, Y.Wu, andD.Mumford. Filters, randomfields ralInformationProcessingSystems, NIPS,pages262–270, andmaximumentropy(FRAME):Towardsaunifiedtheory 2015. 1,2,3 fortexturemodeling. IJCV,27(2),1998. 3 [5] L.A.Gatys,A.S.Ecker,andM.Bethge.Aneuralalgorithm ofartisticstyle. CoRR,abs/1508.06576,2015. 1,2,3 [6] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D.Warde-Farley, S.Ozair, A.C.Courville, andY.Bengio. Generativeadversarialnets. InAdvancesinNeuralInforma- tionProcessingSystems(NIPS),pages2672–2680,2014. 3, 4 [7] A.Gretton,K.M.Borgwardt,M.Rasch,B.Scho¨lkopf,and A.J.Smola. Akernelmethodforthetwo-sample-problem. InAdvancesinneuralinformationprocessingsystems,NIPS, pages513–520,2006. 3 [8] J.Johnson, A.Alahi, andL.Fei-Fei. Perceptuallossesfor real-time style transfer and super-resolution. In Computer Vision - ECCV 2016 - 14th European Conference, Amster- dam, The Netherlands, October 11-14, 2016, Proceedings, PartII,pages694–711,2016. 1,2,3,4,6 [9] B.Julesz. Textons, theelementsoftextureperception, and theirinteractions. Nature,290(5802):91–97,1981. 2 [10] T. Kim and Y. Bengio. Deep directed generative models with energy-based probability estimation. arXiv preprint arXiv:1606.03439,2016. 4 [11] D. P. Kingma and M. Welling. Auto-encoding variational bayes. CoRR,abs/1312.6114,2013. 4 [12] L.F.KozachenkoandN.N.Leonenko. Sampleestimateof the entropy of a random vector. Probl. Inf. Transm., 23(1- 2):95–101,1987. 2,4 [13] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS,pages1106–1114,2012. 5 [14] Y. Li, K. Swersky, and R. S. Zemel. Generative moment matching networks. In Proc. International Conference on MachineLearning,ICML,pages1718–1727,2015. 3 [15] J.PortillaandE.P.Simoncelli. Aparametrictexturemodel based on joint statistics of complex wavelet coefficients. IJCV,40(1):49–70,2000. 1