Table Of Content

DeepFace: Face Generation using Deep Learning HardieCate FahimDalvi ZeshanHussain ccate@stanford.edu fdalvi@cs.stanford.edu zeshanmh@stanford.edu 7 Abstract follows: Given a set of facial attributes as input, 1 0 produce an image of a well-formed face as output 2 WeuseCNNstobuildasystemthatbothclassi- thatcontainsthesecharacteristics. n fiesimagesoffacesbasedonavarietyofdifferent In our face generation system, we fine-tune a a facial attributes and generates new faces given a CNN pre-trained on faces to create a classifica- J setofdesiredfacialcharacteristics. Afterintroduc- tionsystemforfacialcharacteristics. Weemploya 7 ing the problem and providing context in the first noveltechniquethatmodelsdistributionsoffeature section, we discuss recent work related to image activationswithintheCNNasacustomizedGaus- ] V generationinSection2. InSection3, wedescribe sianmixturemodel.Wethenperformfeatureinver- themethodsusedtofine-tuneourCNNandgener- sion using the relevant features identified by this C atenewimagesusinganovelapproachinspiredby model. The primary advantage of our implemen- . s aGaussianmixturemodel.InSection4,wediscuss tation is that it does not require any deep learning c our working dataset and describe our preprocess- architecturesapartfromaCNNwhereasothergen- [ ingstepsandhandlingoffacialattributes. Finally, erativeapproachesdo. Ourfacegenerationsystem 1 inSections5,6and7,weexplainourexperiments hasmanypotentialuses,includingidentifyingsus- v and results and conclude in the following section. pectsinlawenforcementsettingsaswellasinother 6 7 Our classification system has 82% test accuracy. moregenericgenerativesettings. 8 Furthermore, our generation pipeline successfully 1 createswell-formedfaces. 2.RelatedWork 0 . 1 Work surrounding generative models for deep 0 1.Introduction learning has mostly been in developing graphi- 7 cal models, autoencoder frameworks, and more 1 Convolutional neural networks (CNNs) are recently, generative recurrent neural networks : v powerful tools for image classification and object (RNNs). Specificgraphicalmodelsthathavebeen Xi detection, but they can also be used to generate used to learn generative models of data are Re- images. There are myriad image generation tasks stricted Boltzmann Machines (RBMs), an undi- r a andapplications,rangingfromgeneratingpotential rected graphical model with connected stochastic mass lesions in radiology scans to creating land- visible and stochastic hidden units, and their gen- scapesorartisticscenes. Inourproject,weaddress eralizations, such as Gaussian RBMs. Srivastava the problem of generating faces based on certain and Salakhutdinov use these basic RBMs to cre- desiredfacialcharacteristics. Potentialfacialchar- ate a Deep Boltzmann Machine (DBM), a multi- acteristicsfallwithinthegeneralcategoriesofraw modalmodelthatlearnsaprobabilitydensityover attributes(e.g.,bignose, brownhair,etc.),ethnic- the space of multimodal inputs and can be effec- ity(e.g.,white,black,Indian),andaccessories(e.g. tivelyusedforinformationretrievalandclassifica- sunglasses,hat,etc.). Theproblemcanbestatedas tion tasks [14]. Similar work done by Salakhut- 1 dinovandHintonshowshowthelearningofahigh neously trained: a generative model G that cap- capacityDBMwithmultiplehiddenlayersandmil- tures the distribution of the data and a discrimi- lionsofparameterscanbemademoreefficientwith native model D that estimates the probability that a layer-by-layer ”pre-training” phase that allows a sample came from the training data rather than for more reasonable weight initializations by in- G. GistrainedtomaximizetheprobabilitythatD corporating a bottom-up pass [13]. In his thesis, makesamistake. Salakhutdinovalsoaddstothislearningalgorithm 3.Methods byincorporatingatop-downfeedbackpassaswell as a bottom-up pass, which allows DBMs to bet- Ourworkflowconsistsoftwosteps. Firstly,we ter propagate uncertainty about ambiguous inputs finetune a pre-trained model to classify facial and [12]. imagecharacteristicsforaninputimage. Secondly, Other generative approaches involve using au- wegeneratefacesgivensomedescription. Wede- toencoders.Thefirstideasregardingtheprobabilis- scribeourapproachforeachstepintheforthcom- ticinterpretationofautoencoderswereproposedby ingsections. Ranzato et al.; a more formal interpretation was givenbyVincent,whodescribeddenoisingautoen- 3.1.Notation coders (DAEs) [10] [15]. A DAE takes an input To facilitate the discussion of our methods, we x ∈ [0,1]d and first maps it, with an encoder, introduce some notation. Let D denote our set of to a hidden representation y ∈ [0,1]d(cid:48) through trainingimages, andletF denotethesetof73fa- some mapping, y = s(Wx + b), where s is a cialattributes. Fromthispointforward,weusethe non-linearity such as a sigmoid. The latent rep- wordattributetoreferstrictlytooneofthe73facial resentation y is then mapped back via a decoder characteristicsinoursystem. intoareconstructionzofthesameshapeasx,i.e. z=d(W(cid:48)y+b(cid:48)). Theparameters,W,b,b(cid:48),and 3.2.Fine-tuning W(cid:48)arelearnedsuchthattheaveragereconstruction For finetuning, we employ the VGG-Face net, loss between x and z is minimized [9]. Bengio et a 16-layer CNN that was trained on 2 million al. showanalternateformoftheDAE:givensome celebrityfacesandevaluatedonfacesfromtheLa- observedinputX andcorruptedinputX(cid:101),whereX(cid:101) beledFacesintheWildandYouTubefacesdatasets has been corrupted based on a conditional distri- [11].UsingtheVGG-Facenetasourbasearchitec- butionC(X(cid:101)|X), wetraintheDAEtoestimatethe ture,wethenattach42headstotheendofthefc-7 reverseconditionalP(X|X(cid:101))[2]. Withthisformu- layeroftheVGG-Facenet,eachofwhichconsists lation,Vincentetal. constructadeepernetworkof ofafullyconnectedfc-8layerandasoftmaxlayer. stackedDAEstolearnusefulrepresentationsofthe Each softmax head is a multilabel classifier for a inputs[16]. group of attributes that are highly correlated. For AnalternatemodelhasbeenpositedbyGregor example,onegroupofattributesincludeshaircolor etal.,whoproposeusingarecurrentneuralnetwork (black,gray,brown,andblond);ingeneral,some- architecturetogeneratedigits. Thisarchitectureis oneusuallyonlyhasonecolorofhair,whichmakes atypeofvariationalautoencoder,arecentadvanced groupingtheseparticularattributestogetherasone modelthatbridgesdeeplearningandvariationalin- multilabel classification reasonable. Furthermore, ference, since it is comprised of an encoder RNN grouping together highly correlated attributes for thatcompressestherealimagesduringtrainingand our classification task in lieu of having 73 binary a decoder RNN that reconstitutes images after re- classifiers provides the network with implicit in- ceivingcodes[5]. formationontherelationshipbetweengroupedfea- Finally, another approach for estimating gener- tures. ative models is via generative adversarial nets [3] During training, we freeze certain layers in the [4]. In this framework, two models are simulta- CNN and learn the weights and biases on the un- 2 Figure1: ModifiedVGG-FaceArchitecture frozenlayers. Todeterminewhichlayerstofreeze, This baseline serves primarily to demonstrate that we run several experiments with different sets of our CNN correctly represents important facial frozen layers and choose the set with the optimal structures such as eyes, noses, mouths, etc., even performance. We describe the experiments to as- after fine-tuning on specific facial attributes. This sess our architecture and the results of the fine- approach is not comprehensive enough because tuninginSection5. boosting certain features from the last layer in the network does not limit the number of facial struc- 3.3.Generation tures that appear in the image. A more complete discussion of the results of the baseline approach 3.3.1 BaselineApproach canbefoundintheSection5,6. We begin the generation phaseof our project with Our next and final approach uses a customized a simple baseline approach that uses class visual- variant of a Gaussian Mixture Model (GMM). We ization. This approach begins with the mean im- optforthistechniqueovermoretraditionalgenera- age M, which is the pixel-wise and channel-wise tiveapproachesdiscussedinSection2fortwomain meanofallimagesinourtrainingset.Weaddsmall reasons. Firstly, to our knowledge, this method is GaussiannoisetoM andusethisimageX asour novel and has not been used in context of image input to our baseline algorithm. Suppose we wish generation using CNNs. Secondly, this approach toproduceafacewithasetofattributesC ⊆F. In does not require complex structures in addition to each iteration of our algorithm, we do the follow- ourexistingCNN,sothemodelissimplerandhas ing: farfewerparameterstotrain. 1. PerformaforwardpasswithX asinput; 3.3.2 CustomGaussianMixtureModel 2. Set the gradients at each softmax layer to ei- This approach behaves similarly to the baseline in ther1or0,dependingonourtargetattributes thatwebeginwithaninputimageX thatconsists C; ofrandomnoiseandboostthisimagetowardaset 3. Performabackwardpassthroughthenetwork of attributes C ⊆ F. The primary difference is withthesegradients; that instead of class visualization, we use feature inversion with respect to a layer in the CNN. We 4. Update X using stochastic gradient descent also begin with random noise instead of the mean (SGD)andregularization. image. 3 Intuitively, feature inversion tries to minimize Specifically,foreachattributef ,f ,...,f ∈ 1 2 73 the difference in the activations of an input image F, we select 73 random sets of images andsometargetactivationsinacertainlayerofthe S ,S ,...,S ⊆ D of equal size such that 1 2 73 CNN. These target activations are the activations for each i, every image in S has attribute f . i i thatwebelieveaninputimagewiththedesiredat- Thenforeachi,wecomputeameanvectorµ and i tributesCwillhavewhenpassedthroughtheCNN. covariance matrix Σ based on the activations at i More formally, Let φ (I) be the activations at layer l of the images in S . These mean vectors l i layerlwhenanimageIispassedthroughtheCNN. and covariance matrices define 73 independent Let T denote the target activations for layer l and multivariateGaussiandistributions. l supposewehaveawayofdeterminingT fromC. Thus,accordingtoourmodel,ifwewishtopro- l We wish to find an image I∗ by solving the opti- duce possible activations at l of an image that has mizationproblem attribute f and no other f ∈ F, we can simply i sample from N(µ ,Σ ). More generally, we can I∗ =argmin(cid:107)T −φ (I)(cid:107)2+R(I) (1) i i l l 2 estimatethetargetactivationsforasetofattributes I byperformingaweightedsumofsamplesfromthe where(cid:107)·(cid:107)2 isthesquaredEuclideannormandR 2 Gaussians,whichweassumetobeindependent. If isaregularizersuchasblurringand/orjitter. Given wewishtoestimatethetargetactivationsatlfora T andX,featureinversiondoesthefollowing: l subset C ⊆ F of attributes, we would produce a 1. PerformaforwardpasswithX asinputupto samplesgivenby layerl; 73 1 (cid:88) 2. Set the gradients at l by taking the gradient s= |C| 1iwizi (2) oftheobjectivefunctionabovewithrespectto i=1 theactivationsinlayerl; where 1 = 1 if f ∈ C and 0 otherwise, w is i i i 3. Perform a backward pass from l with these the weight assigned to the ith Gaussian, and zi ∼ N(µ ,Σ )isarandomvariable. Thisscanbeused gradients; i i inplaceofT forfeatureinversion. l 4. UpdateX usingSGDandregularization. Nowwemustdeterminetheweightsvectorw = (w ,...,w ). In our approach, we learn this The challenge of this approach is determining 1 73 weights vector by trying to find weights that min- anappropriatesetoftargetactivationsT givenC. l imizethedifferencebetweenthetargetactivations To our knowledge, there is no established way to produced by the sampling method above and the automatically detect these target activations given actualactivationsforimages.Wewishtofindthese solelythedesiredfacialattributes. weightsbysolvingtheoptimizationproblem To address this problem, we introduce a custom Gaussian Mixture Model (cGMM), in which |D| wemodelthedistributionofTlforagivenattribute w∗ =argmin 1 (cid:88)(cid:107)φ (I )−Φ(cid:107)2+λ(cid:107)w(cid:107)2 f asamultivariateGaussiandistribution. Foreach w |D| l i 2 2 i=1 f ∈ F, we estimate the mean and covariance ma- (3) trixofitsGaussianbysamplingimagesinourtrain- where ingsetthatarepositiveexamplesforthisattribute. 1 (cid:88)73 Φ= 1 w µ (4) For computational simplicity, we assume the co- |C | i,j j j i variance matrix for each distribution is diagonal. j=1 This assumption implies that we effectively treat and1 indicateswhetherf ∈C ,theattribute i,j j i each multivariate Gaussian as a stacking of many set for image i. We learn the weights by taking independent Gaussians that model the distribution derivatives of this objective function with respect ofactivationsforasingleneuroninl. towanditerativelyupdatingwusingSGD. 4 4.DatasetsandFeatures (a)Youth (b)Soft-lighting Figure2: Variationsinlightingconditionsand cameraangles Figure3: 2Drepresentationoffc-7featurespace We are currently using the PubFig dataset [8], whichisacollectionoffacesfrompublicphotosof 200 individuals. The dataset includes 58K images white, while only 3% are black. The number of scraped from various sources on the Internet, and peopleofotherethnicitiesisevenlower,soalarge bounding boxes around the face for each of these portionoftheimagesdonothaveanyethnicityla- images.However,toavoidcopyrightissues,theau- beling. Anotherprimeexampleofsuchadisparity thorsprovidelinkstotheimages,insteadofhosting iseyewear.Around90%ofthepeopleinthedataset the images themselves. We have roughly 20,000 havenoformofeyewearinthephotos. Asanother imagesinourtrainingsetand8,000imagesinour exampleoftheincompletenessofourdataset,only testset. Themissingimagesincludethosethatdo 60%ofthepeoplehaveageinformationassociated notexistatthegivenlinks,thosethathavechanged withtheirfaces.Thesenumbersgiveusagoodidea andafewthathadincorrectboundingboxesforthe ofthelimitsofthesystemwearebuildingandhow faces. certain attributes might generate better faces than Sincetheseimagesarescrapedfrompublicpho- others. tographs,thereareseveralvariationsinfaces.Most AsmentionedinSection3.2, weusetheVGG- faces are frontal, while some are at slight angles. Face net as our base for finetuning. Before train- Thereisalsoawidespectrumoflightingconditions ingthenetwork,weperformsomeanalysistosup- inthedataset. port the validity of VGG-Face net as our base Foreachimageinthedataset,wealsohavealist architecture. Specifically, we perform a forward of 73 attributes including gender, eye color, face pass for a set of images from our dataset and ob- shape and ethnicity. Many of these attributes are tain activations at the fc-7 layer. We then trans- interdependent. For example, only one of the at- formthespaceoftheseactivationsintotwodimen- tributes within black hair, blond hair and brown sions using t-SNE. As evidenced by Figure 3(a), hairispositivelylabeledforeachindividual. weseeclusteringevenwhenusingtheweightsdi- rectly from VGG-Face net. This gives us con- 5.Experiments&Analysis fidence that the network is a good candidate for transfer learning. However, clustering does not 5.1.FinetuningExperiments occur for all of the attributes, as seen in Figure Tounderstandthedatabetter,wefirstrunsome 3(b). This is also expected, since the VGG-Face experimentstofindoutwhatthedistributionofour netlearnedattributesthatwereimportantindistin- data looks like. For each image, we are given in- guishing celebrity faces from each other, and cer- formation about the presence or absence of 73 at- tainattributeslikesoft-lightingmaynotprovidethe tributes. In Figure 7 we see that the distribution discriminatorypowertodoso. Afterlookingatthe is quite skewed towards certain attributes. For ex- overallresultsacrosstheattributes,wedecidedthat ample,around76%ofthepeopleinthedatasetare weneedtobackpropagateourgradientsintoatleast 5 haslearnedwhatthesegeneralstructuresintheim- ages look like. Because class visualization seems Figure5: TargetImage,VanillaFeatureInversion, cGMMFeatureInversion toboostanypartoftheimagewhoseactivationsare closetothedesiredfeatures,weshiftourapproach to feature inversion. For some image with desired Figure4: Losscurveafterunfreezingvarious attributes, there exist corresponding activations at layersinthenetwork an arbitrary layer that we can use to generate the image.Tosanitycheckthisidea,weperformanex- someofthelayersinthenetwork. perimentwhereweusethegroundtruthactivations Next, we tune hyperparameters to get the best ataconvolutionallayerforaBarackObamaimage resultsfromtransferlearning. Wetryawiderange andattempttoreconstructObamausingtheseacti- of learning rates, diagnosing the learning process vations. Weobtaintheseactivationsbyperforming by looking at the loss curve. We also try various a forward pass on the image and stopping at the configurations of freezing layers in the network. desired convolutional layer. The resultant Barack AswecanseefromFigure4,backpropagatinginto Obama image is shown in Figure 5. This result morelayersinthenetworkhelpsthelearningpro- shows that the feature inversion method is a rea- cess,butthefinallossisverysimilaracrossalltri- sonableapproachaslongaswehaveaccesstothe als. Wealsotrydifferentlearningratesforthelay- ”target” activations of an image with selected at- ers based on their distance to the heads. Results tributes. As described in our methods, we use a weresimilartotheprevioustrials,withthenetwork cGMMtoestimatethesetargetactivations. convergingtomoreorlessthesameloss. 5.2.ImageGenerationExperiments Wefirstrunseveralexperimentsusingourbase- line class visualization approach. For our primary experiment, we attempt to boostseveral features that vary in frequency of appearance in our dataset,includingblack (ethnicity)andblackhair, (a)Distributionofelement (b)Distributionofelement forwhichtherearefewerimages,andmiddle-aged, infc-6activationsforblack infc-6activationsformale smiling, and male, for which the frequency of im- attribute attribute agesismuchhigher. Aresultantimageofthisex- periment is shown in Figure 5. We see that the final image has multiple facial structures, i.e. mul- Figure6: 2Drepresentationoffc-6featurespace tiple sets of noses, eyes, eyebrows, etc. However, the image also contains some semblance of black To justify modeling each attribute by an inde- hair and a facial structure that is more masculine. pendentmultivariateGaussian,wetaketheimages Our baseline demonstrates that the trained CNN insetS forlabelf andforeachimage,plotasin- i i 6 gle element of the activation at the fc-6 layer. We person similarity% thenplotthedistributionofthatparticularelement BarackObama 35 intheactivationacrossalltheimagesforthatlabel CliveOwen 27 i. Lookingattwodistributionsrepresentingtwoel- CristianoRonaldo 45 ements of the fc-6 activation in Figure 6, we see JaredLeto 52 thatthedistributionsarereasonablyclosetoanor- JuliaRoberts 42 maldistribution. Ingeneral,afterplottingdistribu- MickeyRourke 22 tions for each element across a sample of images MileyCyrus 42 for an arbitrary label, any arbitrary distribution of NicoleRichie 30 element j of the fc-6 activation for label i seems RyanSeacrest 30 tobenormal(notallgraphsareshown). Thus,be- causewehavesomeconfidencethatthetruedistri- Table2: SimilarityPercentagesforDifferentFaces butionofthetargetactivationsisGaussian,wecan proceedwithourcGMMformulation. that are defined by several images in the test set Now, we perform experiments to learn the (i.e. wepluginthesetargetattributesintooursys- weightsinourcGMM.Theforwardandbackward tem and compare the image we generate with the passes of are written from scratch, so we use nu- ground truth image). One example of a generated merical gradient checking to validate the analytic imageisinFigure8. Tomeasurehow”reasonable” gradients. To choose the learning rate and regu- our faces are, we use a similarity metric imple- larizationparameters,westartbyarbitrarilysetting mentedbythePicTrievsoftware,whichusescom- thelearningrateandregularizationparameterto1e- monfacialfeaturestomeasuresimilarity[1]. This 6 and 1e-5, respectively. Then, we incrementally quantitativeevaluationofourfacesisshowninTa- reduce the learning rate until the loss starts to de- ble2. crease instead of diverging to infinity. Finally, we use the trained model to generate several faces of differingattributes. 6.Results trainingset testset Figure8: CristianoRonaldogeneratedimage fc-6 0.847 0.795 conv-5 0.867 0.815 7.Discussion Table1: Averageclassificationaccuracyfreezing parametersbeneathlayersfc-6andconv-5 Although our classification accuracies are generally quite high, there are some facial attributes Our final training parameters for the finetuning suchaslighting(e.g.,softlighting,harshlighting) are5e-6forthelearningrateand50forthenumber andphototype(e.g.,colorphoto,posedphoto)that of epochs. As a form of regularization, we utilize havelowaccuracies(roughly55-60%). Wedonot dropout after every layer. The average classifica- attempt to improve these attributes, since they are tion accuracies are shown in Table 1. The classi- notasimportantasother,moreface-relatedcharac- ficationaccuracyforeveryfacialattributeisabove teristics,suchasnoseshapeandforeheadvisibility, 0.5,whilemostarearound0.8or0.9. thathavehighclassificationaccuracies. For our cGMM training, our final learning rate Themostobviousresultfromourgenerationex- and regularization parameter are 1e-11 and 1e-5, perimentsisthateachresultantimageisveryclose respectively.Wegeneratefacesforsetsofattributes tothemeanimageofourtrainingset. Despitethis, 7 Figure7: Distributionofallattributesinourtrainingdataset the fact that we successfully generate a coherent sired image by taking a weighted sum of samples face from random noise indicates that our feature from the Gaussians instead of a weighted sum of inversion process could work if the activations for themeansoftheGaussians. Weoptedforthelatter eachattributeweremorediscriminative. approachbecauseitwasfarmorecomputationally Wehypothesizeseveralreasonswhyouractiva- efficient, but the former approach could allow for tionsarenotdiscriminativeenough. Onepossibil- morevariationfromthemeanimage. ity is that the distributions for different attributes 8.Conclusion&FutureWork areverysimilarandclusteredaroundthemeanim- age in this abstract space. Another possibility is In this work, we have trained an architecture that the distributions lie in different manifolds in that classifies facial attributes of images with high thisspace,butthattheyarenotnormalyetroughly accuracy. We have also shown the effectiveness centeredaroundthemean-image. Thismeansthat of the custom Gaussian Mixture Model in gener- theactofaveragingtheseactivationsandapproxi- ating well-formed faces from random noise. Al- matingthemasGaussianblursthedistinctionsbe- though the faces themselves were not discrimina- tweenthesedistributions. tive enough for the desired features, we posit that This bring us to the assumptions about the fea- wecantrainamodelthatgeneratesdiscriminative ture activations upon which the cGMM approach facesifourdatasetcontainedmorediversityacross relies. Our first assumption is that the distribution theattributes. of activations for each attribute in a given layer Alternatively,wemightimplementavariational is indeed Gaussian. Additionally, we assume that autoencoderstructureasin[5]andcomparethere- these Gaussians are independent. This is perhaps sultswiththoseofthecGMM.Weexpectthatsince theleastjustifiedassumptionthemodelmakes,be- the variational autoencoder has a higher capacity, cause clearly many attributes are correlated, such it might give more discriminative faces but would as middle-aged and white hair. Despite this, we alsotakelongertotrain. Moreefficientgenerative hopethatourmodelapproximatesthetruedistribu- modelsopenupexcitingpossibilitiesinotherappli- tionsufficientlyforourpurposes. Wealsoassume cations,includingmedicineandlawenforcement. that the variables in the multivariate Gaussian are themselves distributed normally and are independent. We believe this assumption is valid because weexaminedseveralofthesedistributionsforindi- vidualvariablesasdescribedinSection5. One way to improve our cGMM technique wouldbetoestimatethetargetactivationsofade- 8 References F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, AdvancesinNeuralInforma- [1] AppliedDevice. Pictriev: searching faces on the tionProcessingSystems25,pages2222–2230.Cur- web,2010. [Online;accessed2016-03-14]. ranAssociates,Inc.,2012. [2] Y.Bengio,L.Yao,G.Alain,andP.Vincent.Gener- [15] P.Vincent. Aconnectionbetweenscorematching alizeddenoisingauto-encodersasgenerativemod- anddenoisingautoencoders. Neuralcomputation, els. InAdvancesinNeuralInformationProcessing 23(7):1661–1674,2011. Systems,pages899–907,2013. [16] P.Vincent,H.Larochelle,I.Lajoie,Y.Bengio,and [3] J.Gauthier.Conditionalgenerativeadversarialnets P.-A.Manzagol. Stackeddenoisingautoencoders: for convolutional face generation. Class Project Learningusefulrepresentationsinadeepnetwork for Stanford CS231N: Convolutional Neural Net- withalocaldenoisingcriterion.TheJournalofMa- works for Visual Recognition, Winter semester, chineLearningResearch,11:3371–3408,2010. 2014,2014. [4] I.Goodfellow,J.Pouget-Abadie,M.Mirza,B.Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Ad- vancesinNeuralInformationProcessingSystems, pages2672–2680,2014. [5] K. Gregor, I. Danihelka, A. Graves, and D. Wier- stra. Draw: A recurrent neural network for im- agegeneration. arXivpreprintarXiv:1502.04623, 2015. [6] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Dar- rell. Caffe: Convolutionalarchitectureforfastfea- ture embedding. arXiv preprint arXiv:1408.5093, 2014. [7] E. Jones, T. Oliphant, P. Peterson, et al. SciPy: Open source scientific tools for Python, 2001–. [Online;accessed2016-03-14]. [8] N.Kumar,A.C.Berg,P.N.Belhumeur,andS.K. Nayar. Attribute and Simile Classifiers for Face Verification. InIEEEInternationalConferenceon ComputerVision(ICCV),Oct2009. [9] L.Lab. Denoisingautoencoders(da). 2016. [10] C.P.Marc’AurelioRanzato,S.Chopra,andY.Le- Cun. Efficient learning of sparse representations with an energy-based model. In Proceedings of NIPS,2007. [11] A. Z. O. M. Parkhi, A. Vedaldi. Deep Face Recognition.InBritishMachineVisionConference (BMVC),2015. [12] R.Salakhutdinov. Learningdeepgenerativemod- els. PhDthesis,UniversityofToronto,2009. [13] R. Salakhutdinov and G. E. Hinton. Deep boltz- mannmachines.InInternationalconferenceonar- tificial intelligence and statistics, pages 448–455, 2009. [14] N. Srivastava and R. R. Salakhutdinov. Multi- modallearningwithdeepboltzmannmachines. In 9