ebook img

LR-GAN: Layered Recursive Generative Adversarial Networks for Image Generation PDF

21 Pages·2017·4.59 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview LR-GAN: Layered Recursive Generative Adversarial Networks for Image Generation

PublishedasaconferencepaperatICLR2017 LR-GAN: LAYERED RECURSIVE GENERATIVE AD- VERSARIAL NETWORKS FOR IMAGE GENERATION JianweiYang∗ AnithaKannan DhruvBatra∗andDeviParikh∗ VirginiaTech FacebookAIResearch GeorgiaInstituteofTechnology Blacksburg,VA MenloPark,CA Atlanta,GA [email protected] [email protected] {dbatra, parikh}@gatech.edu 7 ABSTRACT 1 0 We present LR-GAN: an adversarial image generation model which takes scene 2 structure and context into account. Unlike previous generative adversarial net- r works(GANs),theproposedGANlearnstogenerateimagebackgroundandfore- a groundsseparatelyandrecursively,andstitchtheforegroundsonthebackground M inacontextuallyrelevantmannertoproduceacompletenaturalimage. Foreach foreground, the model learns to generate its appearance, shape and pose. The 5 whole model is unsupervised, and is trained in an end-to-end manner with gra- ] dientdescentmethods. TheexperimentsdemonstratethatLR-GANcangenerate V morenaturalimageswithobjectsthataremorehumanrecognizablethanDCGAN. C . s c 1 INTRODUCTION [ Generativeadversarialnetworks(GANs)(Goodfellowetal.,2014)haveshownsignificantpromise 1 asgenerativemodelsfornaturalimages. Aflurryofrecentworkhasproposedimprovementsover v 0 the original GAN work for image generation (Radford et al., 2015; Denton et al., 2015; Salimans 6 et al., 2016; Chen et al., 2016; Zhu et al., 2016; Zhao et al., 2016), multi-stage image generation 5 includingpart-basedmodels(Imetal.,2016;Kwak&Zhang,2016),imagegenerationconditioned 1 oninputtextorattributes(Mansimovetal.,2015;Reedetal.,2016b;a),imagegenerationbasedon 0 3Dstructure(Wang&Gupta,2016),andevenvideogeneration(Vondricketal.,2016). . 3 Whiletheholistic‘gist’ofimagesgeneratedbytheseapproachesisbeginningtolooknatural,there 0 isclearlyalongwaytogo.Forinstance,theforegroundobjectsintheseimagestendtobedeformed, 7 blendedintothebackground,andnotlookrealisticorrecognizable. 1 : Onefundamentallimitationofthesemethodsisthattheyattempttogenerateimageswithouttaking v i intoaccountthatimagesare2Dprojectionsofa3Dvisualworld,whichhasalotofstructuresinit. X Thismanifestsasstructureinthe2Dimagesthatcapturethisworld. Oneexampleofthisstructure r isthatimagestendtohaveabackground, andforegroundobjectsareplacedinthisbackgroundin a contextuallyrelevantways. WedevelopaGANmodelthatexplicitlyencodesthisstructure. Ourproposedmodelgeneratesim- agesinarecursivefashion: itfirstgeneratesabackground,andthenconditionedonthebackground generates a foreground along with a shape (mask) and a pose (affine transformation) that together definehowthebackgroundandforegroundshouldbecomposedtoobtainacompleteimage. Condi- tionedonthiscompositeimage,asecondforegroundandanassociatedshapeandposearegenerated, andsoon. Asabyproductinthecourseofrecursiveimagegeneration,ourapproachgeneratessome object-shape foreground-background masks in a completely unsupervised way, without access to anyobjectmasksfortraining. Notethatdecomposingasceneintoforeground-backgroundlayersis aclassicalill-posedproblemincomputervision. Byexplicitlyfactorizingappearanceandtransfor- mation,LR-GANencodesnaturalpriorsabouttheimagesthatthesameforegroundcanbe‘pasted’ tothedifferentbackgrounds,underdifferentaffinetransformations. Accordingtotheexperiments, the absence of these priors result in degenerate foreground-background decompositions, and thus alsodegeneratefinalcompositeimages. ∗WorkwasdonewhilevisitingFacebookAIResearch. 1 PublishedasaconferencepaperatICLR2017 Figure1: GenerationresultsofourmodelonCUB-200(Welinderetal.,2010). Itgeneratesimages intwotimesteps. Atthefirsttimestep,itgeneratesbackgroundimages,whilegeneratesforeground images, masksandtransformationsatthesecondtimestep. Then, theyarecomposedtoobtainthe final images. From top left to bottom right (row major), the blocks are real images, generated backgroundimages, foregroundimages, foregroundmasks, carvedforegroundimages, carvedand transformed foreground images, final composite images, and their nearest neighbor real images in thetrainingset. Notethatthemodelistrainedinacompletelyunsupervisedmanner. Wemainlyevaluateourapproachonfourdatasets:MNIST-ONE(onedigit)andMNIST-TWO(two digits)synthesizedfromMNIST(LeCunetal.,1998),CIFAR-10(Krizhevsky&Hinton,2009)and CUB-200(Welinderetal.,2010).Weshowqualitatively(viasamples)andquantitatively(viaevalu- ationmetricsandhumanstudiesonAmazonMechanicalTurk)thatLR-GANgeneratesimagesthat globally look natural and contain clear background and object structures in them that are realistic andrecognizablebyhumansassemanticentities. AnexperimentalsnapshotonCUB-200isshown inFig.1. WealsofindthatLR-GANgeneratesforegroundobjectsthatarecontextuallyrelevantto thebackgrounds(e.g.,horsesongrass,airplanesinskies,shipsinwater, carsonstreets, etc.). For quantitativecomparison,besidesexistingmetricsintheliterature,weproposetwonewquantitative metricstoevaluatethequalityofgeneratedimages.Theproposedmetricsarederivedfromthesuffi- cientconditionsfortheclosenessbetweengeneratedimagedistributionandrealimagedistribution, andthussupplementexistingmetrics. 2 RELATED WORK Early work in parametric texture synthesis was based on a set of hand-crafted features (Portilla & Simoncelli, 2000). Recent improvements in image generation using deep neural networks mainly fall into one of the two stochastic models: variational autoencoders (VAEs) (Kingma et al., 2016) andgenerativeadversarialnetworks(GANs)(Goodfellowetal.,2014). VAEspairatop-downprob- abilisticgenerativenetworkwithabottomuprecognitionnetworkforamortizedprobabilisticinfer- ence.Twonetworksarejointlytrainedtomaximizeavariationallowerboundonthedatalikelihood. GANs consist of a generator and a discriminator in a minmax game with the generator aiming to foolthediscriminatorwithitssampleswiththelatteraimingtonotgetfooled. Sequentialmodelshavebeenpivotalforimprovedimagegenerationusingvariationalautoencoders: DRAW (Gregor et al., 2015) uses attention based recurrence conditioning on the canvas drawn so far. In Eslami et al. (2016), a recurrent generative model that draws one object at a time to the canvaswasusedasthedecoderinVAE.Thesemethodsareyettoshowscalabilitytonaturalimages. EarlycompellingresultsusingGANsusedsequentialcoarse-to-finemultiscalegenerationandclass- conditioning (Denton et al., 2015). Since then, improved training schemes (Salimans et al., 2016) andbetterconvolutionalstructure(Radfordetal.,2015)haveimprovedthegenerationresultsusing 2 PublishedasaconferencepaperatICLR2017 GANs. PixelRNN(vandenOordetal.,2016)isalsorecentlyproposedtosequentiallygeneratesa pixelatatime,alongthetwospatialdimensions. In this paper, we combine the merits of sequential generation with the flexibility of GANs. Our model for sequential generation imbibes a recursive structure that more naturally mimics image compositionbyinferringthreecomponents: appearance,shape,andpose. Onecloselyrelatedwork combiningrecursivestructurewithGANisthatofImetal.(2016)butitdoesnotexplicitlymodel objectcompositionandfollowsasimilarparadigmasbyGregoretal.(2015). Anothercloselyre- lated work is that of Kwak & Zhang (2016). It combines recursive structure and alpha blending. However, ourworkdiffersinthreemainways: (1)Weexplicitlyuseageneratorformodelingthe foregroundposes. Thatprovidessignificantadvantagefornaturalimages,asshownbyourablation studies; (2) Our shape generator is separate from the appearance generator. This factored repre- sentation allows more flexibility in the generated scenes; (3) Our recursive framework generates subsequent objects conditioned on the current and previous hidden vectors, and previously gener- ated object. This allows for explicit contextual modeling among generated elements in the scene. SeeFig.17forcontextuallyrelevantforegroundsgeneratedforthesamebackground,orFig.6for meaningfulplacementoftwoMNISTdigitsrelativetoeach. Models that provide supervision to image generation using conditioning variables have also been proposed: Style/StructureGANs(Wang&Gupta,2016)learnsseparategenerativemodelsforstyle and structure that are then composed to obtain final images. In Reed et al. (2016a), GAN based imagegenerationisconditionedontextandtheregionintheimagewherethetextmanifests,spec- ifiedduringtrainingviakeypointsorboundingboxes. Whilenotthefocusofourwork,themodel proposedinthispapercanbeeasilyextendedtotakeintoaccounttheseformsofsupervision. 3 PRELIMINARIES 3.1 GENERATIVEADVERSARIALNETWORKS Generative Adversarial Networks (GANs) consist of a generator G and a discriminator D that are simultaneously trained with competing goals: The generator G is trained to generate samples that can ‘fool’ a discriminator D, while the discriminator is trained to classify its inputs as either real (comingfromthetrainingdataset)orfake(comingfromthesamplesofG). Thiscompetitionleads toaminmaxformulationwithavaluefunction: (cid:16) (cid:17) minmax E [log(D(x;θ ))]+E [log(1−D(G(z;θ );θ ))] , (1) θG θD x∼pdata(x) D z∼pz(z) G D wherez isarandomvectorfromastandardmultivariateGaussianorauniformdistributionp (z), z G(z;θ ) maps z to the data space, D(x) is the probability that x is real estimated by D. The G advantage of the GANs formulation is that it lacks an explicit loss function and instead uses the discriminatortooptimizethegenerativemodel. Thediscriminator, inturn, onlycareswhetherthe sampleitreceivesisonthedatamanifold, andnotwhetheritexactlymatchesaparticulartraining example (as opposed to losses such as MSE). Hence, the discriminator provides a gradient signal onlywhenthegeneratedsamplesdonotlieonthedatamanifoldsothatthegeneratorcanreadjust itsparametersaccordingly. Thisformoftrainingenableslearningthedatamanifoldofthetraining setandnotjustoptimizingtoreconstructthedataset,asinautoencoderanditsvariants. WhiletheGANsframeworkislargelyagnostictothechoiceofGandD,itisclearthatgenerative modelswiththe‘right’inductivebiaseswillbemoreeffectiveinlearningfromthegradientinfor- mation(Dentonetal.,2015;Imetal.,2016;Gregoretal.,2015;Reedetal.,2016a;Yanetal.,2015). Withthismotivation,weproposeageneratorthatmodelsimagegenerationviaarecurrentprocess –ineachtimestepoftherecurrence,anobjectwithitsownappearanceandshapeisgeneratedand warpedaccordingtoageneratedposetocomposeanimageinlayers. 3.2 LAYEREDSTRUCTUREOFIMAGE Animagetakenofour3Dworldtypicallycontainsalayeredstructure. Onewayofrepresentingan imagelayerisbyitsappearanceandshape. Asanexample,animagexwithtwolayers,foreground f andbackgroundbmaybefactorizedas: x=f (cid:12)m+b(cid:12)(1−m), (2) 3 PublishedasaconferencepaperatICLR2017 wheremisthemaskdepictingtheshapesofimagelayers, and(cid:12)theelementwisemultiplication operator. Someexistingmethodsassumetheaccesstotheshapeoftheobjecteitherduringtraining (Isola&Liu,2013)orbothattrainandtesttime(Reedetal.,2016a;Yanetal.,2015). Representing imagesinlayeredstructureisevenstraightforwardforvideowithmovingobjects(Darrell&Pent- land,1991;Wang&Adelson,1994;Kannanetal.,2005). Vondricketal.(2016)generatesvideos byseparatelygeneratingafixedbackgroundandmovingforegrounds. Asimilarwayofgenerating singleimagecanbefoundinKwak&Zhang(2016). Anotherwayismodelingthelayeredstructurewithobjectappearanceandposeas: x=ST(f,a)+b, (3) wheref andbareforegroundandbackground, respectively; aistheaffinetransformation; ST is thespatialtransformationoperator. Severalworksfallintothisgroup(Rouxetal.,2011;Huang& Murphy,2015;Eslamietal.,2016).InHuang&Murphy(2015),imagesaredecomposedintolayers ofobjectswithspecificposesinavariationalautoencoderframework,whilethenumberofobjects (i.e.,layers)isadaptivelyestimatedinEslamietal.(2016). To contrast with these works, LR-GAN uses a layered composition, and the foreground layers si- multaneouslymodelallthreedominantfactorsofvariation:appearancef,shapemandposea. We willelaborateitinthefollowingsection. 4 LAYERED RECURSIVE GAN (LR-GAN) ThebasicstructureofLR-GANissimilartoGAN:itconsistsofadiscriminatorandageneratorthat are simultaneously trained using the minmax formulation of GAN, as described in §.3.1. The key innovationofourworkisthelayeredrecursivegenerator,whichiswhatwedescribeinthissection. ThegeneratorinLR-GANisrecursiveinthattheimageisconstructedrecursivelyusingarecurrent network. Layeredinthateachrecursivestepcomposesanobjectlayerthatis‘pasted’ontheimage generated so far. Object layer at timestep t is parameterized by the following three constituents – ‘canonical’appearancef ,shape(ormask)m ,andpose(oraffinetransformation)a forwarping t t t theobjectbeforepastingintheimagecomposition. Fig.2showsthearchitectureoftheLR-GANwiththegeneratorarchitectureunrolledforgenerating . backgroundx (= x )andforegroundx andx . Ateachtimestept,thegeneratorcomposesthe 0 b 1 2 nextimagex viathefollowingrecursivecomputation: t x = ST(m ,a ) (cid:12) ST(f ,a ) +(1−ST(m ,a ))(cid:12)x , ∀t∈[1,T] (4) t t t t t t t t−1 (cid:124) (cid:123)(cid:122) (cid:125) (cid:124) (cid:123)(cid:122) (cid:125) (cid:124) (cid:123)(cid:122) (cid:125) affinetransformedmask affinetransformedappearance pastingonimagecomposedsofar whereST((cid:5),a )isaspatialtransformationoperatorthatoutputstheaffinetransformedversionof t (cid:5)witha indicatingparametersoftheaffinetransformation. t Sinceourproposedmodelhasanexplicittransformationvariablea thatisusedtowarptheobject, t it can learn a canonical object representation that can be re-used to generate scenes where the ob- ject occurs as mere transformations of it, such as different scales or rotations. By factorizing the appearance, shape and pose, the object generator can focus on separately capturing regularities in thesethreefactorsthatconstituteanobject. Wewilldemonstrateinourexperimentsthatremoving thesefactorizationsfromthemodelleadstoitsspendingcapacityinvariabilitythatmaynotsolely beabouttheobjectinSection5.5and5.6. 4.1 DETAILSOFGENERATORARCHITECTURE Fig.2showsourLR-GANarchitectureindetail–weusedifferentshapestoindicatedifferentkinds oflayers(convolutional,fractionalconvolutional,(non)linear,etc),asindicatedbythelegend. Our modelconsistsoftwomainpieces–abackgroundgeneratorG andaforegroundgeneratorG . G b f b andG donotshareparameterswitheachother. G computationhappensonlyonce,whileG is f b f recurrentovertime,i.e.,allobjectgeneratorssharethesameparameters. Inthefollowing,wewill introduceeachmoduleandconnectionsbetweenthem. Temporal Connections. LR-GAN has two kinds of temporal connections – informally speaking, oneon‘top’andoneon‘bottom’. The‘top’connectionsperformtheactofsequentially‘pasting’ 4 PublishedasaconferencepaperatICLR2017 𝑥 𝐶 6 𝐷 𝐶 𝑥2 Real sample 𝑓15 𝑚45 𝑓12 𝑚42 𝑆 Fractional Convolutional 𝑆 𝑓5 𝑚5 Layers 𝑥 𝐺’ $ 𝑓2 𝑚2 𝐺"% 𝐺"& Convolutional Layers 𝐺$ 𝐺’ 𝐺"% 𝐺"& 𝑇" (laNyoenrs)l ianneda ro ethmebrsedding 𝑇" 𝐺"# Spatial Sampler 𝐺"# 𝑃"# 𝐸"# 𝐸", Compositor LSTM LSTM LSTM 𝑧8 𝑧2 𝑧5 Figure2: LR-GANarchitectureunfoldedtothreetimesteps. Itmainlyconsistsofonebackground generator,oneforegroundgenerator,temporalconnectionsandonediscriminator. Themeaningof eachcomponentisexplainedinthelegend. objectlayers(Eqn.4). The‘bottom’connectionsareconstructedbyaLSTMonthenoisevectors z ,z ,z . Intuitively, this noise-vector-LSTM provides information to the foreground generator 0 1 2 about what else has been generated in past. Besides, when generating multiple objects, we use a poolinglayerPcandafully-connectedlayerEc toextracttheinformationfrompreviousgenerated f f objectresponsemap. Bythisway,themodelisableto‘see’previouslygeneratedobjects. BackgroundGenerator.ThebackgroundgeneratorG ispurposelykeptsimple.Ittakesthehidden b state of noise-vector-LSTM h0 as the input and passes it to a number of fractional convolutional l layers(alsocalled‘deconvolution’layerinsomepapers)togenerateimagesatitsend. Theoutput ofbackgroundgeneratorx willbeusedasthecanvasforthefollowinggeneratedforegrounds. b ForegroundGenerator.TheforegroundgeneratorG isusedtogenerateanobjectwithappearance f and shape. Correspondingly, G consists of three sub-modules, Gc, which is a common ‘trunk’ f f whose outputs are shared by Gi and Gm. Gi is used to generate the foreground appearance f , f f f t while Gm generates the mask m for the foreground. All three sub-modules consists of one or f t morefractionalconvolutionallayerscombinedwithbatch-normalizationandnonlinearlayers. The generatedforegroundappearanceandmaskhavethesamespatialsizeasthebackground. Thetop ofGmisasigmoidlayerinordertogenerateonechannelmaskwhosevaluesrangein(0,1). f Spatial Transformer. To spatially transform foreground objects, we need to estimate the trans- formationmatrix. AsinJaderbergetal.(2015),wepredicttheaffinetransformationmatrixwitha linearlayerT thathassix-dimensionaloutputs.Thenbasedonthepredictedtransformationmatrix, f weuseagridgeneratorG togeneratethecorrespondingsamplingcoordinatesintheinputforeach g locationattheoutput. Thegeneratedforegroundappearanceandmasksharethesametransforma- tion matrix, and thus the same sampling grid. Given the grid, the sampler S will simultaneously sample the f and m to obtain fˆ and mˆ , respectively. Different from Jaderberg et al. (2015), t t t t oursamplerherenormallyperformsdownsampling,sincethetheforegroundtypicallyhassmaller sizethanthebackground. Pixelsinfˆ andmˆ thatarefromoutsidetheextentoff andm areset t t t t tozero. Finally, fˆ andmˆ aresenttothecompositorC whichcombinesthecanvasx andfˆ t t t−1 t throughlayeredcompositionwithblendingweightsgivenbymˆ (Eqn.4). t Pseudo-codeforourapproachanddetailedmodelconfigurationareprovidedintheAppendix. 5 PublishedasaconferencepaperatICLR2017 4.2 NEWEVALUATIONMETRICS Several metrics have been proposed to evaluate GANs, such as Gaussian parzen window (Good- fellow et al., 2014), Generative Adversarial Metric (GAM) (Im et al., 2016) and Inception Score (Salimansetal.,2016). Thecommongoalistomeasurethesimilaritybetweenthegenerateddata distributionP (x) = G(z;θ )andtherealdatadistributionP(x). Mostrecently,InceptionScore g z hasbeenusedinseveralworks(Salimansetal.,2016;Zhaoetal.,2016).However,itisanassymetric metricandcouldbeeasilyfooledbygeneratingcentersofdatamodes. In addition to these metrics, we present two new metrics based on the following intuition – a suf- ficient(butnotnecessary)conditionforclosenessofP (x)andP(x)isclosenessofP (x|y)and g g P(x|y), i.e., distributions of generated data and real data conditioned on all possible variables of interesty,e.g.,categorylabel. Onewaytoobtainthisvariableofinterestyisviahumanannotation. Specifically,giventhedatasampledfromP (x)andP(x),weaskpeopletolabelthecategoryofthe g samplesaccordingtosomerules. Notethatsuchhumanannotationisofteneasierthancomparing samplesfromthetwodistributions(e.g., becausethereisno1:1correspondencebetweensamples toconductforced-choicetests). Aftertheannotations,weneedtoverifywhetherthetwodistributionsaresimilarineachcategory. Clearly, directly comparing the distributions P (x|y) and P(x|y) may be as difficult as compar- g ing P (x) and P(x). Fortunately, we can use Bayes rule and alternatively compare P (y|x) and g g P(y|x), which is a much easier task. In this case, we can simply train a discriminative model on the samples from P (x) and P(x) together with the human annotations about categories of these g samples. Withaslightabuseofnotation,weuseP (y|x)andP(y|x)todenoteprobabilityoutputs g from these two classifiers (trained on generated samples vs trained on real samples). We can then usethesetwoclassifierstocomputethefollowingtwoevaluationmetrics: AdversarialAccuracy: Computestheclassificationaccuraciesachievedbythesetwoclassifierson avalidationset,whichcanbethetrainingsetoranothersetofrealimagessampledfromP(x). If P (x)isclosetoP(x),weexpecttoseesimilaraccuracies. g AdversarialDivergence: ComputestheKLdivergencebetweenP (y|x)andP(y|x). Thelower g theadversarialdivergence,theclosertwodistributionsare. Thelowboundforthismetricisexactly zero,whichmeansP (y|x)=P(y|x)forallsamplesinthevalidationset. g Asdiscussedabove,weneedhumaneffortstolabeltherealandgeneratedsamples. Fortunately,we canfurthersimplifythis. Basedonthelabelsgivenontrainingdata,wesplitthetrainingdatainto categories,andtrainonegeneratorforeachcategory.Withallthesegenerators,wegeneratesamples ofallcategories. Thisstrategywillbeusedinourexperimentsonthedatasetswithlabelsgiven. 5 EXPERIMENT We conduct qualitative and quantitative evaluations on three datasets: 1) MNIST (LeCun et al., 1998); 2) CIFAR-10 (Krizhevsky & Hinton, 2009); 3) CUB-200 (Welinder et al., 2010). To add variabilitytotheMNISTimages,werandomlyscale(factorof0.8to1.2)androtate(−π to π)the 4 4 digitsandthenstitchthemto48×48uniformbackgroundswithrandomgrayscalevaluebetween [0, 200]. Images are then rescaled back to 32×32. Each image thus has a different background grayscalevalueandadifferenttransformeddigitasforeground. Werenamethissythensizeddataset asMNIST-ONE(singledigitonagraybackground). WealsosynthesizeadatasetMNIST-TWO containing two digits on a grayscale background. We randomly select two images of digits and perform similar transformations as described above, and put one on the left and the other on the rightsideofa78×78graybackground. Weresizethewholeimageto64×64. We develop LR-GAN based on open source code1. We assume the number of objects is known. Therefore, for MNIST-ONE, MNIST-TWO, CIFAR-10, and CUB-200, our model has two, three, two, and two timesteps, respectively. Since the size of foreground object should be smaller than thatofcanvas,wesettheminimalallowedscale2 inaffinetransforamtiontobe1.2foralldatasets exceptforMNIST-TWO,whichissetto2(objectsaresmallerinMNIST-TWO).InLR-GAN,the 1https://github.com/soumith/dcgan.torch 2Scalecorrespondstothesizeofthetargetcanvaswithrespecttotheobject–thelargerthescale,thelarger thecanvas,andthesmallertherelativesizeoftheobjectinthecanvas.1meansthesamesizeasthecanvas. 6 PublishedasaconferencepaperatICLR2017 Figure3: GeneratedimagesonCIFAR-10basedonourmodel. Figure4: GeneratedimagesonCUB-200basedonourmodel. background generator and foreground generator have similar architectures. One difference is that thenumberofchannelsinthebackgroundgeneratorishalfoftheoneintheforegroundgenerator. We compare our results to that of DCGAN (Radford et al., 2015). Note that LR-GAN without LSTM at the first timestep corresponds exactly to the DCGAN. This allows us to run controlled experiments. Inbothgeneratoranddiscriminator,allthe(fractional)convolutionallayershave4× 4 filter size with stride 2. As a result, the number of layers in the generator and discriminator automaticallyadapttothesizeoftrainingimages. PleaseseetheAppendix(Section6.2)fordetails abouttheconfigurations.Weusethreemetricsforquantitativeevaluation,includingInceptionScore (Salimansetal.,2016)andtheproposedAdversarialAccuracy,AdversarialDivergence. Notethat we report two versions of Inception Score. One is based on the pre-trained Inception net, and the otheroneisbasedonthepre-trainedclassifieronthetargetdatasets. 5.1 QUALITATIVERESULTS InFig.3and4,weshowthegeneratedsamplesforCIFAR-10andCUB-200,respectively. MNIST resultsareshowninthenextsubsection. Aswecanseefromtheimages,thecompositionalnature ofourmodelresultsintheimagesbeingfreeofblendingartifactsbetweenbackgroundsandfore- grounds. ForCIFAR-10,wecanseethehorsesandcarswithclearshapes. ForCUB-200,thebird shapestendtobeevensharper. 5.2 MNIST-ONEANDMNIST-TWO WenowreporttheresultsonMNIST-ONEandMNIST-TWO.Fig.5showsthegenerationresultsof ourmodelonMNIST-ONE.Aswecansee,ourmodelgeneratesthebackgroundandtheforeground in separate timestep, and can disentagle the foreground digits from background nearly perfectly. Thoughinitialvaluesofthemaskrandomlydistributeintherangeof(0,1),aftertraining,themasks arenearlybinaryandaccuratelycarveoutthedigitsfromthegeneratedforeground. Moreresultson MNIST-ONE(includinghumanstudies)canbefoundintheAppendix(Section6.3). Fig.6showsthegenerationresultsforMNIST-TWO.Similarly,themodelisalsoabletogenerate backgroundandthetwoforegroundobjectsseparately. Theforegroundgeneratortendstogenerate a single digit at each timestep. Meanwhile, it captures the context information from the previous timesteps. Whenthefirstdigitisplacedtotheleftside, thesecondonetendstobeplacedonthe rightside,andviceversa. 7 PublishedasaconferencepaperatICLR2017 Figure5: GenerationresultsofourmodelonMNIST-ONE.Fromlefttoright,theimageblocksare realimages,generatedbackgroundimages,generatedforegroundimages,generatedmasksandfinal compositeimages,respectively. Figure 6: Generation results of our model on MNIST-TWO. From top left to bottom right (row major), the image blocks are real images, generated background images, foreground images and masks at the second timestep, composite images at the second time step, generated foreground imagesandmasksatthethirdtimestepandthefinalcompositeimages,respectively. 5.3 CUB-200 We study the effectiveness of our model trained on the CUB-200 bird dataset. In Fig. 1, we have shownarandomsetofgeneratedimages,alongwiththeintermediategenerationresultsofthemodel. While being completely unsupervised, the model, for a large fraction of the samples, is able to Figure7: MatchedpairsofgeneratedimagesbasedonDCGANandLR-GAN,respectivedly. The odd columns are generated by DCGAN, and the even columns are generated by LR-GAN. These arepairedaccordingtotheperfectmatchingbasedonHungarianalgorithm. 8 PublishedasaconferencepaperatICLR2017 Figure8: QualitativecomparisononCIFAR-10. TopthreerowsareimagesgeneratedbyDCGAN; Bottom three rows are by LR-GAN. From left to right, the blocks display generated images with increasingqualitylevelasdeterminedbyhumanstudies. successfully disentangle the foreground and the background. This is evident from the generated bird-likemasks. WedoacomparativestudybasedonAmazonMechanicalTurk(AMT)betweenDCGANandLR- GANtoquantifyrelativevisualqualityofthegeneratedimages. Wefirstgenerated1000samples from both the models. Then, we performed perfect matching between the two image sets using the Hungarian algorithm on L2 norm distance in the pixel space. This resulted in 1000 image pairs. SomeexamplarpairsareshowninFig.7. Foreachimagepair,9judgesareaskedtochoose the one that is more realistic. Based on majority voting, we find that our generated images are selected68.4%times,comparedwith31.6%timesforDCGAN.Thisdemonstratesthatourmodel hasgeneratedmorerealisticimagesthanDCGAN.Wecanattributethisdifferencetoourmodel’s abilitytogenerateforegroundseparatelyfromthebackground,enablingstrongeredgecues. 5.4 CIFAR-10 WenowqualitativelyandquantitativelyevaluateourmodelonCIFAR-10,whichcontainsmultiple objectcategoriesandalsovariousbackgrounds. Comparison of image generation quality: We conduct AMT studies to compare the fidelity of image generation. Towards this goal, we generate 1000 images from DCGAN and LR-GAN, re- spectively. We ask 5 judges to label each image to either belong to one of the 10 categories or as ‘nonrecognizable’or‘recognizablebutnotbelongingtothelistedcategories’. Wethenassigneach imageaqualitylevelbetween[0,5]thatcapturesthenumberofjudgesthatagreewiththemajority choice. Fig.8showstheimagesgeneratedbybothapproaches,orderedbyincreasingqualitylevel. Wemergeimagesatqualitylevel0(alljudgessaidnon-recognizable)and1together,andsimilarly imagesatlevel4and5. Visually,thegeneratedsamplesbyourmodelhaveclearerboundariesand objectstructures. Wealsocomputedthefractionofnon-recognizableimages:Ourmodelhada10% absolute drop in non-recognizability rate (67.3% for ours vs. 77.7% for DCGAN). For reference, 11.4%ofrealCIFARimageswerecategorizedasnon-recognizable. Fig.9showsmoregenerated (intermediate)resultsofourmodel. Quantitative evaluation on generators: We evaluate the generators based on three metrics: 1) InceptionScore;2)AdversarialAccuracy;3)AdversarialDivergence. Toobtainaclassifiermodel for evaluation, we remove the top layer in the discriminator used in our model, and then append two fully connected layers on the top of it. We train this classifier using the training samples of CIFAR-10basedontheannotations. FollowingSalimansetal.(2016),wegenerated50,000images Table1: QuantitativecomparisonbetweenDCGANandLR-GANonCIFAR-10. TrainingData RealImages DCGAN Ours InceptionScore† 11.18±0.18 6.64±0.14 7.17±0.07 InceptionScore†† 7.23±0.09 5.69±0.07 6.11±0.06 AdversarialAccuracy 83.33±0.08 37.81±0.02 44.22±0.08 AdversarialDivergence 0 7.58±0.04 5.57±0.06 †Evaluateusingthepre-trainedInceptionnetasSalimansetal.(2016) ††EvaluateusingthesupervisedlytrainedclassifierbasedonthediscriminatorinLR-GAN. 9 PublishedasaconferencepaperatICLR2017 Figure9: GenerationresultsofourmodelonCIFAR-10. Fromlefttoright,theblocksare: gener- ated background images, foreground images, foreground masks, foreground images carved out by masks,carvedforegroundsafterspatialtransformation,finalcompositeimagesandnearestneighbor trainingimagestothegeneratedimages. Figure10:CategoryspecificgenerationresultsofourmodelonCIFAR-10categoriesofhorse,frog, andcat(toptobottom). Theblocksfromlefttorightare:generatedbackgroundimages,foreground images,foregroundmasks,foregroundimagescarvedoutbymasks,carvedforegroundsafterspatial transformationandfinalcompositeimages. based on DCGAN and LR-GAN, repsectively. We compute two types of Inception Scores. The standardInceptionScoreisbasedontheInceptionnetasinSalimansetal.(2016),andthecontex- tualInceptionScoreisbasedonourtrainedclassifiermodel. Todistinguish,wedenotethestandard oneas‘InceptionScore†’,andthecontextualoneas‘InceptionScore††’. ToobtaintheAdversarial Accuracy and Adversarial Divergence scores, we train one generator on each of 10 categories for DCGANandLR-GAN,respectively. Then,weusethesegeneratorstogeneratesamplesofdifferent categories. Giventhesegeneratedsamples,wetraintheclassifiersforDCGANandLR-GANsepa- rately. Alongwiththeclassifiertrainedontherealsamples,wecomputetheAdversarialAccuracy 10

Description:
Dhruv Batra∗and Devi Parikh∗. Georgia and so on. As a byproduct in the course of recursive image generation, our approach generates some.
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.