Face Synthesis from Facial Identity Features ForresterCole DavidBelanger∗ DilipKrishnan GoogleResearch UniversityofMassachusettsAmherst GoogleResearch [email protected] [email protected] [email protected] AaronSarna InbarMosseri WilliamT.Freeman GoogleResearch GoogleResearch CSAIL,MITandGoogleResearch 7 1 [email protected] [email protected] [email protected] 0 2 n Abstract a J 4 Wepresentamethodforsynthesizingafrontal,neutral- 2 expression image of a person’s face given an input face photograph. This is achieved by learning to generate fa- ] ciallandmarksandtexturesfromfeaturesextractedfroma V facial-recognitionnetwork. Unlikepreviousgenerativeap- C ⇓ ⇓ ⇓ proaches, our encoding feature vector is largely invariant . 1024-Dfeatures 1024-Dfeatures 1024-Dfeatures s tolighting, pose, andfacialexpression. Exploitingthisin- c ⇓ ⇓ ⇓ variance,wetrainourdecodernetworkusingonlyfrontal, [ neutral-expression photographs. Since these photographs 2 are well aligned, we can decompose them into a sparse v set of landmark points and aligned texture maps. The de- 1 coder then predicts landmarks and textures independently 5 8 and combines them using a differentiable image warping 4 operation.Theresultingimagescanbeusedforanumberof 0 applications, such as analyzing facial attributes, exposure . 1 andwhitebalanceadjustment,orcreatinga3-Davatar. 0 7 1 1.Introduction : v Xi Recentworkincomputervisionhasproduceddeepneu- Figure 1. Input photos (top) are encoded using a face recogni- ral networks that are extremely effective at face recogni- tion network [1] into 1024-D feature vectors, then decoded into r a tion,achievinghighaccuracyovermillionsofidentities[3]. animageofthefaceusingourdecodernetwork(middle). Thein- These networks embed an input photograph in a high- varianceoftheencodernetworktopose,lighting,andexpression dimensional feature space, where photos of the same per- allowsthedecodertoproduceanormalizedfaceimage. There- sultingimagescanbeeasilyfittoa3-Dmodel[2](bottom). Our son map to nearby points. The feature vectors produced method can even produce plausible reconstructions from black- byanetworksuchasFaceNet[1]areremarkablyconsistent and-whitephotographsandpaintingsoffaces. acrosschangesinpose,lighting,andexpression.Asiscom- monwithneuralnetworks,however,thefeaturesareopaque tohumaninterpretation. Thereisnoobviouswaytoreverse theembeddingandproduceanimageofafacefromagiven strained: theoutputimagehas150×moredimensionsthan featurevector. the input feature vector. Our key idea is to exploit the in- WepresentamethodformappingfromFaceNetfeatures variance of the FaceNet features to pose, lighting, and ex- backtoimagesoffaces. Thisproblemishugelyundercon- pressionbyposingtheproblemasmappingfromafeature vectortoanevenly-lit,front-facing,neutral-expressionface, ∗WorkdoneduringaninternshipatGoogle. which we call a normalized face image. Intuitively, the 1 mapping from identity to normalized face image is nearly one-to-one, so we can train a decoder network to learn it (Fig. 1). We train the decoder network on a carefully- constructedsetofnormalizedfaceimages. Because the facial identity features are so reliable, the trained decoder network is robust to a broad range of nui- sancefactorssuchasocclusion,lighting,andposevariation, Figure3.Fromlefttoright: Inputtrainingimage,detectedfacial and can even successfully operate on monochrome pho- landmarkpoints,andtheresultofwarpingtheinputimagetothe tographsorpaintings. Therobustnessofthenetworksetsit meanfacegeometry. apart from related methods that directly frontalize the face by warping the input image to a frontal pose [4, 5], which the image pixels by gradient descent [7, 8, 9, 10], produc- cannotcompensateforocclusionorlightingvariation. ingimagessimilarto“DeepDream”[11]. Becausethepixel Theconsistencyoftheresultingnormalizedfaceallows space is so large relative to the feature space, optimiza- arangeofapplications.Forexample,theneutralexpression tionrequiresheavyregularizationterms,suchastotalvari- of the synthesized face and the facial landmark locations ation [9] or Gaussian blur [10]. The resulting images are make it easy to fit a 3-D morphable model [2] to create a intriguing,butnotrealistic. virtual reality avatar (Sec. B). Automatic color correction A second, more closely-related approach trains a feed- and white balancing can also be achieved by transforming forward network to reverse a given embedding [12, 13]. thecoloroftheinputphotographtomatchthecolorofthe DosovitskiyandBrox[13]posethisproblemasconstruct- predictedface(Sec.7.3).Finally,ourmethodcanbeusedas ing the most likely image given a feature vector. Our anexploratorytoolforvisualizingwhatfeaturesarereliably method, incontrast, usesthemorerestrictivecriterionthat capturedbyafacialrecognitionsystem. theimagemustbeanormalizedface. Similar to the active shape model of Lanitis et al. [6], Finally,perhapsthemostrelevantpriorworktoourcon- ourdecodernetworkexplicitlydecouplestheface’sgeome- tribution is Zhmoginov and Sandler [14], which presents tryfromitstexture. Inourcase,thedecoderproducesboth both iterative and and feed-forward methods for inverting a registered texture image and the positions of facial land- Facenet embeddings to recover an image of a face. Their marksasintermediateactivations. Basedonthelandmarks, techniques require no new training data, and are instead thetextureiswarpedtoobtainthefinalimage. bootrapped from a pretrained Facenet model and a collec- Indevelopingourmodel,wetackleafewtechnicalchal- tion of smoothness priors. On the other hand, our method lenges. First,end-to-endlearningrequiresthatthewarping generallyproducesbetterfine-graineddetails. operation is differentiable. We employ an efficient, easy- to-implement method based on spline interpolation. This 2.2.ActiveAppearanceModelforFaces allowsustocomputeFaceNetsimilaritybetweentheinput The active appearance model of Cootes et al. [15] and and output images as a training objective, which helps to its extension to 3-D by Blanz and Vetter [2] provide para- retainperceptually-relevantdetails. metricmodelsformanipulatingandgeneratingfaceimages. Second, it is difficult to obtain large amounts of front- Themodelisfittolimiteddatabydecouplingfacesintotwo facing, neutral-expression training data. In response, components:textureT andthefaciallandmarkgeometryL. we employ a data-augmentation scheme that exploits the Fig. 3showsthisdecouplingforasingleimage. InFig. 3 texture-shape decomposition, where we randomly morph (middle), a set L of landmark points (e.g., tip of nose) are thetrainingimagesbyinterpolatingwithnearestneighbors. detected. In Fig. 3 (right), the image is warped such that Theaugmentedtrainingsetallowsforfittingahigh-quality itslandmarksarelocatedatthetrainingdataset’smeanland- neuralnetworkmodelusingonly1Kuniqueinputimages. marklocationsL¯.Thewarpingoperationalignsthetextures Thetechniquesintroducedinthiswork,suchasdecom- so that, for example, the left pupil in every training image positionintogeometryandtexture,dataaugmentation,and liesatthesamepixelcoordinates. differentiablewarping,areapplicabletodomainsotherthan In[15,2], theauthorsfitseparateprincipalcomponents facenormalization. analysis(PCA)modelstothetexturesandgeometry. These canbefitreliablyusingsubstantiallylessdatathanaPCA 2.BackgroundandRelatedWork model on the raw images. An individual face is described bythecoefficientsoftheprincipalcomponentsoftheland- 2.1.InvertingDeepNeuralNetworkFeatures marksandtextures. Toreconstructtheface,thecoefficients Theinterestinunderstandingdeepnetworks’predictions areun-projectedtoobtainreconstructedlandmarksandtex- hasledtoseveralapproachesforcreatinganimagefroma ture,thenthetextureiswarpedtothelandmarks. particular feature vector. One approach directly optimizes Therearevarioustechniquesforwarping. Forexample, Figure2.Thoughthemodelwasonlytrainedonnaturalimages,itisrobustenoughtobeappliedtodegradedphotographsandillustrations. Column1:inputimage.Column2:generated2-Dimage.Columns3and4:imagesof3-Dreconstructiontakenfrom2differentangles. Blanz and Vetter [2] define triangulations for both L and architectureissimilartothepopularInceptionmodel[16]. L¯ andapplyanaffinetransformationforeachtriangleinL FaceNet is trained with a triplet loss: the embeddings of to map it to the corresponding triangle in L¯. In Sec. 4 we two pictures of person A should be more similar than the employanalternativebasedonsplineinterpolation. embedding of a picture of person A and a picture of per- sonB.Thislossencouragesthemodeltocaptureaspectsof 2.3.FaceNet a face pertaining to its identity, such geometry, and ignore factorsofvariationspecifictotheinstanttheimagewascap- FaceNet[1]isadeepneuralnetworkthatmapsfromface tured,suchaslighting,expression,pose,etc. Furthermore, images taken in the wild to 128-dimensional features. Its FaceNetistrainedonaverylargedataset, andencodesin- 3.1.Encoder formation about a wide variety of human faces. Recently, Our encoder takes an input image I and returns an f- modelstrainedonpubliclyavailabledatahaveapproached dimensional feature vector F. We need to choose the en- or exceeded FaceNet’s performance [17], and we believe coder carefully so that is robust to shifts in the domains thatourresultscanbeduplicatedwiththesenetworks. of images. In response, we employ a pretrained FaceNet WeemployFaceNetbothasasourceofpretrainedinput model[1]anddonotupdateitsparameters.Ourassumption featuresandasasourceofatrainingloss: theinputimage is that FaceNet normalizes away variation in face images and the generated image should have similar FaceNet em- that is not indicative of the identity of the subject. There- beddings. Using pretrained deep networks to provide loss fore, the embeddings of the controlled training images get functions is useful because these losses may be more cor- mappedtothesamespaceasthosetakeninthewild. This related with perceptual, rather than pixel-level, differences allowsustoonlytrainonthecontrolledimages. inimages[18,19]. Whentrainingwithadversariallosses,a Instead of the final FaceNet output, we use the lowest deep-network-basedlossislearnedsimultaneouslywiththe layer that is not spatially varying: the 1024-D “avgpool” parametersofthemodelofinterest[20,21,22,23,24]. layerofthe“NN2”architecture. Wetrainafully-connected layerfrom1024tof dimensionsontopofthislayer. 2.4.FaceFrontalization 3.2.Decoder Priorworkinfacefrontalizationadoptsanon-parametric approachtoregisteringandnormalizingfaceimagestaken We could have mapped from F to an output image di- in the wild [25, 26, 27, 28, 5, 4]. Landmarks are detected rectly using a deep network. This would need to simul- ontheinputimageandthesearealignedtopointsonaref- taneously model variation in the geometry and textures of erence3-Dor2-Dmodel. Then,theimageispastedonthe faces. AswithLanitisetal.[6], wehavefounditsubstan- referencemodelusingnon-linearwarping. Finally,theren- tiallymoreeffectivetoseparatelygeneratelandmarksLand deredfront-facingimagecanbefedtodownstreammodels texturesT andrenderthefinalresultusingwarping. that were trained on front-facing images. The approach is We generate L using a shallow multi-layer perceptron largelyparameter-freeanddoesnotrequirelabeledtraining with ReLU non-linearities applied to F. To generate the data, but struggles to normalize variation due to lighting, texture images, we use a deep CNN. We first use a fully- expression,orocclusion(Fig.16). connectedlayertomapfromF to56×56×256localized features. Then, we use a set of stacked transposed convo- 2.5.FaceGenerationusingNeuralNetworks lutions[33],separatedbyReLUs,withakernelwidthof5 and stride of 2 to upsample to 224 × 224 × 32 localized Unsupervised learning of generative image models is features. The number of channels after the ith transposed an active research area, and many papers evaluate on the convolution is 256/2i. Finally, we apply a 1×1 convolu- celebA dataset [29] of face images [29, 30, 31, 32]. In tiontoyield224×224×3RGBvalues. these,thegeneratedimagesaresmallerandgenerallylower- Because we are generating registered texture images, it qualitythanours.Contrastingtheseapproachesvs.oursys- isnotunreasonabletouseafully-connectednetwork,rather temisalsochallengingbecausetheydrawindependentsam- than a deep CNN. This maps from F to 224 × 224 × 3 ples, whereas we generate images conditional on an input pixelvaluesdirectlyusingalineartransformation. Despite image. Therefore, we can not achieve high quality simply the spatial tiling of the CNN, these models have roughly bymemorizingcertainprototypes. thesamenumberofparameters. Wecontrasttheoutputsof theseapproachesinSec.7.4. 3.AutoencoderModel The decoder combines the textures and landmarks us- ingthedifferentiablewarpingtechniquedescribedinSec.4. We assume a training set of front-facing, neutral- Withthis,theentiremappingfrominputimagetogenerated expression training images. As preprocessing, we decom- imagecanbetrainedend-to-end. poseeachimageintoatextureT andasetoflandmarksL using off-the-shelf landmark detection tools and the warp- 3.3.TrainingLoss ingtechniqueofSec.4. Attesttime,weconsiderimagestakeninthewild,with OurlossfunctionisasumofthetermsdepictedinFig.5. substantiallymorevariationinlighting,pose,etc.Forthese, First,weseparatelypenalizetheerrorofourpredictedland- applyingourtrainingpreprocessingpipelinetoobtainLand marksandtextures,usingmeansquarederrorandmeanab- T is inappropriate. Instead, we use a deep architecture to soluteerror,respectively. Thisisamoreeffectivelossthan mapdirectlyfromtheimagetoestimatesofLandT. The penalizingthereconstructionerrorofthefinalrenderedim- overallarchitectureofournetworkisshowninFig.4. age. Suppose,forexample,thatthemodelpredictstheeye 4.DifferentiableImageWarping LetI bea2-Dimage.LetL={(x ,y ),...,(x ,y )} 0 1 1 n n be a set of 2-D landmark points and let D = {(dx ,dy ),...,(dx ,dy )}beasetofdisplacementvec- 1 1 n n tors for each control point. In the morphable model, I is 0 thetextureimageT andD =L−L¯ isthedisplacementof thelandmarksfromthemeangeometry. Figure4.ModelArchitecture:Wefirstencodeanimageasasmall WeseektowarpI intoanewimageI suchthatitsat- 0 1 featurevectorusingFaceNet[1](withfixedweights)plusanad- isfies two properties: (a) The landmark points have been ditionalmulti-layerperceptron(MLP)layer,i.e.afullyconnected shifted by their displacements, i.e. I [x ,y ] = I [x + 1 i i 0 i layer with ReLu non-linearities. Then, we separately generate a dx ,y + dy ], and (b) the warping is continuous and re- i i i texturemap,usingadeepconvolutionalnetwork(CNN),andvec- sulting flow-field derivatives of any order are controllable. torofthelandmarks’locations,usinganMLP.Thesearecombined In addition, we require that I is a differentiable function usingdifferentiablewarpingtoyieldthefinalrenderedimage. 1 of I , D, and L. We describe our method in terms of 2-D 0 images,butitgeneralizesnaturallytohigherdimensions. colorcorrectly,butthelocationoftheeyesincorrectly. Pe- nalizing reconstruction error of the output image may en- courage the eye color to resemble the color of the cheeks. However, by penalizing the landmarks and textures sepa- rately,themodelwillincurnocostforthecolorprediction, Figure 6. Image warping: Left: starting landmark locations, andwillonlypenalizethepredictedeyelocation. Middle-left: desired final locations, including zero-displacement boundary conditions, Middle-right: dense flow field obtained by Next,werewardperceptualsimilaritybetweengenerated splineinterpolation,Right:applicationofflowtoimage. imagesandinputimagesbypenalizingthedissimilarityof the FaceNet embeddings of the input and output images. Fig.6describesourwarping,whichconsistsoftwomod- We use a FaceNet network with fixed parameters to com- ulesthatcaneachbeacceleratedusingvectorizedcomputa- pute 128-dimensional embeddings of the two images and tion(especiallyonGPUs)andareeasytoimplementusing penalizetheirnegativecosinesimilarity. Trainingwiththe modernnumericallibraries.First,weconstructadenseflow FaceNetlossaddsconsiderablecomputationalcost:without field from the sparse displacements defined at the control it,wedonotneedtoperformdifferentiablewarpingduring points using spline interpolation. Then, we apply the flow training. Furthermore,evaluatingFaceNetonthegenerated fieldtoI inordertoobtainI . Forthesecondstep,weuse 0 1 image is expensive. See Sec. 7.4 for a discussion of the simple bilinear interpolation, which is differentiable. The impactoftheFaceNetlossontraining. nextsectiondescribesthesplineinterpolationstep. 4.1.DifferentiableSplineInterpolation The interpolation is done independently for horizontal andverticaldisplacements. Foreachdimension,wehavea scalarg definedateach2-DcontrolpointpinLandseekto p produceadense2-Dgridofscalarvalues.Besidesthefacial landmarkpoints,weincludeextrapointsattheboundaryof theimage,whereweenforcezerodisplacement. We employ polyharmonic interpolation [34], where the interpolanthasthefunctionalform n (cid:88) s(x,y)= w φ ((cid:107)(x,y)−(x ,y )(cid:107))+v x+v y+v . Figure 5. Training Computation Graph: Each dashed line con- i k i i 1 2 3 nectstwotermsthatarecomparedinthelossfunction. Textures i=1 (1) are compared using mean absolute error, landmarks using mean Here, φ are a set of radial basis functions. Common squarederror,andFaceNetembeddingusingnegativecosinesim- k choicesareφ (r) = r,andφ (r) = r2log(r)(thepopular ilarity. 1 2 thin-plate spline). For our experiments we choose k = 1, since the linear interpolant is more robust to overshooting thanthethin-platespline,andthelinearizationartifactsare difficulttodetectinthefinaltexture. Polyharmonic interpolation chooses the parameters w ,a,b,c such that s interpolates the signal exactly at the i control points, and such that it minimizes a certain def- inition of curvature [34]. Algorithm 1 shows the com- bined process of estimating the interpolation parameters on training data and evaluating the interpolant at a set of query points. The optimal parameters can be obtained in closed form via operations that are either linear algebra or coordinate-wise non-linearities, all of which are differen- tiable. Therefore, since (1) is a differentiable function of x,y,theentireinterpolationprocessisdifferentiable. Figure 7. Data augmentation using face morphing and gradient- domaincompositing. Theleftcolumncontainsaverageimagesof Algorithm1:DifferentiableSplineInterpolation individuals. Theremainingcolumnscontainrandommorphswith Inputs:pointsP ={(x ,y ),...,(x ,y )}, otherindividualsinthetrainingset. 1 1 n n functionvaluesG={g ,...,g }, 1 n radialbasisfunctionφ , k querypointsQ={(x ,y ),...,(x ,y )} andtexturesindependently,wheretheinterpolationweights 1 1 m m Outputs:Evaluationof(1)usingparametersfitonP,F. aredrawnuniformlyfrom[0,1]. dists =(cid:107)P −P (cid:107) 5.2.Gradient-domainCompositing ij i j A=φ (dists) k Morphingtendstopreservedetailsinsidetheface,where 1 ... 1 the landmarks are accurate, but cannot capture hair and B =x1 ... xn background detail. To make the augmented images more y ... y 1 n (cid:20)w(cid:21) (cid:20)A B(cid:62) (cid:21) (cid:20)G(cid:21) realistic,wepastethemorphedfaceontoanoriginalback- v = B 0 \ 0 %solvelinearsystem groundusingagradient-domaineditingtechnique[35]. Given the texture for a morphed face image T and a Return(cid:80)n w φ ((cid:107)(x,y)−(x ,y )(cid:107))+v x+v y+v f i=1 i k i i 1 2 3 targetbackgroundimageT ,weconstructconstraintsonthe evaluatedateach(x,y)pointinQ. b gradientandcolorsoftheoutputtextureT as: o ∂ ∂ ∂ T = T ◦M + T ◦(1−M) 5.DataAugmentationusingRandomMorphs ∂x o ∂x f ∂x b ∂ ∂ ∂ (3) T = T ◦M + T ◦(1−M) Training our model requires a large, varied database of ∂y o ∂y f ∂y b evenly-lit, front-facing, neutral-expression photos. Col- T ◦M =T ◦M, lecting photographs of this type is difficult, and publicly- o f available databases are too small to train the decoder net- where◦istheelement-wiseproductandtheblendingmask work (see Fig. 12). In response, we construct a small set M isdefinedbytheconvexhulloftheglobalaverageland- ofhigh-qualityphotosandthenuseadataaugmentationap- marks, softened by a Gaussian blur. Equations 3 form an proachbasedonmorphing. over-constrained linear system that we solve in the least- 5.1.Producingrandomfacemorphs squaressense. ThefinalresultisformedbywarpingTo to themorphedlandmarks(Fig.7). Sincethefacesarefrontfacingandhavesimilarexpres- sions, we can generate plausible novel faces by morphing. GivenaseedfaceA,wefirstpickatargetfacebyselecting 6.TrainingData oneofthek = 200nearestneighborsofAatrandom. We measurethedistancebetweenfacesAandBas: 6.1.Collectingphotographs d(A,B)=λ(cid:107)L −L (cid:107)+(cid:107)T −T (cid:107), (2) Thereareavarietyoflarge,publicly-availabledatabases A B A B ofphotographsavailableonline. WechoosetheVGGFace whereLarematricesoflandmarksandT aretexturemaps, dataset[17]foritssizeanditsemphasisonfacialrecogni- and λ = 10.0 in our experiments. Given A and the ran- tion. It contains 2.6M photographs, but very few of these dom neighbor B, we linearly interpolate their landmarks fitourrequirementsoffront-facing,neutral-pose,andsuffi- Inputs Averaged 7.1.ModelRobustness Fig. 16 shows the robustness of our model to nuisance factorssuchasocclusion,poseandillumination. Weuse4 identitiesfromtheLFWdataset[37],and4imagesforeach identity(showninthetoprowofeachquadrant).Theoutput Figure8.Averagingimagesofthesameindividualtoproducecon- fromourmodelisshowninthesecondrow. Theshapeand sistentlighting. Exampleinputphotographs(leftthreecolumns) skin tone of the face is quite stable across different poses havelargevariationinlightingandcolor. Averagingtendstopro- and illumination, but variations such as hair style and hair duceanevenlylit,butstilldetailed,result(rightcolumn). colorarecapturedintheoutputimage(seeforexample,the 3rdand4thcolumnsofthetop-leftquadrant).Severeocclu- cientquality. WeusetheGoogleCloudVisionAPI1 tore- sionssuchassunglassesandheadweardonotsignificantly movemonochromeandblurryimages,faceswithhighemo- impact the output quality. The robustness of our model is tion score or eyeglasses, and tilt or pan angles beyond 5◦. perhapsbestexhibitedbyresultonthepaintinginFig.1. The remaining images are aligned to undo any roll trans- In contrast to our outputs, the face frontalization ap- formation, scaledtomaintainaninteroculardistanceof55 proachofHassneretal.[4]cannotremoveocclusions,han- pixels, andcroppedto224×224. Afterfiltering, wehave dleextremeposes,neutralizesomeexpressions,correctfor approximately12Kimages(<0.5%oftheoriginalset). variabilityinillumination,orrecoverhair. 7.2.3-DModelFitting 6.2.Averagingtoreducelightingvariation Thelandmarksandtextureofthenormalizedfacecanbe VGG Face provides multiple images for an individual. usedtofita3Dmorphablemodel(Fig.10). Fittingamor- Wekeep≈1Kuniqueidentitieswith3ormoreimages. phablemodeltoanunconstrainedimageofafacerequires To further remove variation in lighting, we average all solvingadifficultinverserenderingproblem[2],butfitting imagesforeachindividualbymorphing. Givenasetofim- toanormalizedfaceimageismuchmorestraightforward. agesofanindividualI ,weextractfaciallandmarksL for j j Tofittheshapeoftheface,wefirstmanuallyestablisha eachimage usingthemethodof KazemiandSullivan [36] correspondencebetweenthe65predictedlandmarksl and i and then average the landmarks to form Lµ. Each image the best matching 65 vertices v of the 3-D mesh used to i I is warped to the average landmarks L , then the pixel j µ train the model of Blanz and Vetter [2]. This correspon- values are averaged to form an average image of the indi- denceisbasedonthesemanticsofthelandmarksanddoes vidualI . AsshowninFig.8,thisoperationtendstoeven µ not change for different faces. We then optimize for the outlightingvariation,producingimagesthatresemblepho- shapeparametersthatbestmatchv tol usinggradientde- i i tographs with soft, even lighting. These 1K images form scent. The landmarks provide 65 × 2 = 130 constraints thebasetrainingset. forthe199parametersofthemorphablemodel,sotheop- Thebackgroundsinthetrainingimagesarewidelyvari- timization is additionally regularized towards the average able, leading to noisy backgrounds in our results. Cleaner face. Pleaseseethesupplementarymaterialfordetails. results could probably be obtained by manual removal of Once the face mesh is aligned with the predicted land- thebackgroundsfromtrainingset. marks, we project the synthesized image onto the mesh as vertexcolors. Theprojectionworkswellforareasthatare 7.Experiments closetofront-facing, butisnoisyandimpreciseatgrazing angles. To clean the result, we project the colors further For our experiments we mainly focus on the Labeled ontothemodel’stexturebasistoproduceclean,butlessac- Faces in the Wild [37] dataset, since its identities are mu- curate vertex colors. We then produce a final vertex color tuallyexclusivewithVGGFace. Weincludeafewexample byblendingthesynthesizedimagecolorandthetextureba- fromothersources,suchasapainting,toshowtherangeof sis color based on the foreshortening angle. Again, please themethod. seesupplementarymaterialfordetails. Exceptwhereotherwisenoted,theresultswereproduced AsshowninFig.10,thefittingprocessproducesawell- withthearchitectureofSection3,withweightsontheland- aligned, 3DfacemeshthatcouldbedirectlyusedasaVR mark loss = 1, the FaceNet loss = 10, and texture loss avatar, or could serve as an initialization for further pro- = 100. Our data augmentation produces 1M images. The cessing,forexampleinmethodstotrackfacialgeometryin modelwasimplementedinTensorFlow[38]andtrainedus- video [40, 41]. In our experiments, the fidelity of the re- ingtheAdamoptimizer[39]. constructedshapeislimitedbytherangeofthemorphable model, and could likely be improved with a more diverse 1cloud.google.com/vision modelsuchastherecentLSFM[42]. Figure9.FacenormalizationforpeopleintheLFWdataset[37].Top:inputphotographs.Middle:resultofourmethod.Bottom:resultof Hassner,etal.[4].Noteourmethod’srobustnesstopose,lighting,andpartialocclusion.Additionalresultsinsupplementarymaterial. 7.3.AutomaticPhotoAdjustment Since the normalized face image provides a “ground truth” image of the face, it can be easily applied to auto- maticallyadjusttheexposureandwhitebalanceofaphoto- graph(Fig.11). Weapplythefollowingsimplealgorithm: givenanalignedinputphotographP andthecorresponding normalizedfaceimageN,extractaboxfromthecenterof P and N (in our experiments, the central 100×100 pix- els out of 224×224) and average the cropped regions to formmeanfacecolorsm andm . Theadjustedimageis P N computed using a per-channel, piecewise-linear color shift functionrc(p)overthepixelsofP: (cid:40) pcmcN if pc <=mc (cid:41) rc(p)= 1−(1−pc)1−mmcNcP if pc >mc ,P 1−mc P P (4) where c are the color channels. We chose YCrCb as the colorrepresentationinourexperiments. For comparison, we apply the general white balancing algorithmofBarron[43]. Thisapproachdoesnotfocuson theface,andislimitedintheadjustmentitmakes,whereas our algorithm balances the face regardless of the effect on the other regions of the image, producing more consistent resultsacrossdifferentphotosofthesameperson. 7.4.ImpactofDesignDecisions InFig.12wecontrasttheoutputofoursystemwithtwo variations: amodeltrainedwithoutdataaugmentation,and a model that uses data augmentation, but employs a fully- connectednetworkforpredictingtextures.Trainingwithout Figure10.Mappingofourmodel’soutputontoa3-Dface. Small data augmentation yields more artifacts due to overfitting. images: input and fit 3-D model. Large images: synthesized 2- Thefully-connecteddecodergeneratesimagesthatarevery D image. Photos by Wired.com, CC BY-NC 2.0 (images were cropped). generic, since though it has separate parameters for every pixel,itscapacityislimitedbecausethereisnomechanism forcoordinatingoutputsatmultiplescales. Fig. 13 compares outputs of models trained with and withouttheFaceNetloss. Thedifferenceissubtlebutvisi- ble,andhasaperceptualeffectofimprovingthelikenessof InputImages w/oDataAug. w/DataAug. w/DataAug. Input CNN FC CNN OurMethod Barron[43] Figure12.Outputfromvariousconfigurationsofoursystem. We either train on the 1K raw images or 1M images obtained using thedataaugmentationtechniqueofSec.5. Wegeneratetextures eitherusingafully-connected(FC)orconvnet(CNN)decoder. w/FaceNetloss w/oFaceNetloss InputImages Input FNL2error:0.42 FNL2error:0.8 OurMethod Figure 13. Results with and without loss term penalizing differ- ence in the FaceNet embedding. The FaceNet loss encourages subtle but important improvements in fidelity, especially around the eyes and eyebrows. The result is a lower error between the Barron[43] embeddingsoftheinputandsynthesizedimages. Figure 11. Automatic adjustment of exposure and white balance usingthecolorofthenormalizedfaceforsomeimagesfromthe LFW dataset. In each set of images (2 sets of 3 rows), the first Figure 14. Histograms of FaceNet L error between input and row are the input images; the second row the outputs from out 2 synthesizedimagesonLFW.Blue: withFaceNetloss(Sec.3.3). methodandthethirdrowtheoutputsofBarron[43],astate-of-the- Green: without FaceNet loss. The 1.242 threshold was used by artwhitebalancingmethod.Theimplicitencodingofskintonein Schroffetal. [1]toclusteridentities. WithouttheFaceNetloss, ourmodeliscrucialtotheexposureandwhitebalancerecovery. about2%ofthesynthesizedimageswouldnotbeconsideredthe sameidentityastheinputimage. therecoveredimage. 8.ConclusionandFutureWork The improvement from training with the FaceNet loss We have introduced a neural network that maps from canalsobemeasuredbyevaluatingFaceNetonthetestout- images of faces taken in the wild to front-facing neutral- puts. Fig. 14 shows the distributions of L distances be- expression images that capture the likeness of the individ- 2 tweentheembeddingsoftheLFWimagesandtheircorre- ual. The network is robust to variation in the inputs, such sponding synthesized results, for models trained with and as lighting, pose, and expression, that cause problems for without the FaceNet loss. Schroff et al. [1] consider two prior face frontalization methods. The method provides a FaceNetembeddingstoencodethesamepersoniftheirL variety of down-stream opportunities, including automati- 2 distanceislessthan1.242. WiththeFaceNetloss,allofthe callywhite-balancingimagesandcreatingcustomized3-D synthesizedimagespassthistest,butwithout,about2%of avatars. theimageswouldbemid-identifiedbyFaceNetasadiffer- Spline interpolation has been used extensively in com- entperson. putergraphics,butweareunawareofworkwhereinterpo- lationhasbeenusedasadifferentiablemoduleinsideanet- [10] J. Yosinski, J. Clune, A. M. Nguyen, T. Fuchs, and work. Weencouragefurtherapplicationofthetechnique. H. Lipson, “Understanding neural networks through deep Going forward, we hope to improve the overall qual- visualization,” CoRR, vol. abs/1506.06579, 2015. [Online]. ity of the generated images. Noise artifacts likely result Available:http://arxiv.org/abs/1506.06579 from overfitting, especially in the background, and blurri- ness likely results from using a pixel-level mean-squared- [11] A.Mordvintsev,C.Olah,andM.Tyka.(2015,Jun.)Incep- error. Ideally,wewouldtrainonabroaderselectionofim- tionism:Goingdeeperintoneuralnetworks. agesandavoidpixel-levellossesentirely. Onepossibilityis [12] M.D.ZeilerandR.Fergus,“Visualizingandunderstanding tocombinetheFaceNetlossofSec.3.3withanadversarial convolutional networks,” CoRR, vol. abs/1311.2901, 2013. loss [20], which would allow training on large collections [Online].Available:http://arxiv.org/abs/1311.2901 ofimagesthatarenotfront-facingandneutral-expression. [13] A. Dosovitskiy and T. Brox, “Inverting visual repre- References sentations with convolutional networks,” arXiv preprint arXiv:1506.02753,2015. [1] F. Schroff, D. Kalenichenko, and J. Philbin, “FaceNet: A unified embedding for face recognition and clustering,” in [14] A. Zhmoginov and M. Sandler, “Inverting face embed- Proceedings of the IEEE Conference on Computer Vision dings with convolutional neural networks,” arXiv preprint andPatternRecognition,2015,pp.815–823. arXiv:1606.04189,2016. [2] V.BlanzandT.Vetter,“Amorphablemodelforthesynthesis [15] T.F.Cootes,G.J.Edwards,andC.J.Taylor,“Activeappear- of3dfaces,” inProceedingsofthe26thannualconference ancemodels,”inIEEETransactionsonPatternAnalysisand on Computer graphics and interactive techniques. ACM MachineIntelligence. Springer,1998,pp.484–498. Press/Addison-WesleyPublishingCo.,1999,pp.187–194. [16] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, [3] I. Kemelmacher-Shlizerman, S. M. Seitz, D. Miller, and D.Anguelov, D.Erhan, V.Vanhoucke, andA.Rabinovich, E. Brossard, “The megaface benchmark: 1 million faces “Going deeper with convolutions,” in Proceedings of the forrecognitionatscale,”CoRR,vol.abs/1512.00596,2015. IEEE Conference on Computer Vision and Pattern Recog- [Online].Available:http://arxiv.org/abs/1512.00596 nition,2015,pp.1–9. [4] T. Hassner, S. Harel, E. Paz, and R. Enbar, “Effective [17] O. M. Parkhi, A. Vedaldi, and A. Zisserman, “Deep face facefrontalizationinunconstrainedimages,”inProceedings recognition,” in British Machine Vision Conference, vol. 1, of the IEEE Conference on Computer Vision and Pattern no.3,2015,p.6. Recognition,2015,pp.4295–4304. [18] A. Dosovitskiy and T. Brox, “Generating images with perceptual similarity metrics based on deep networks,” [5] Y.Taigman,M.Yang,M.Ranzato,andL.Wolf,“DeepFace: CoRR, vol. abs/1602.02644, 2016. [Online]. Available: Closingthegaptohuman-levelperformanceinfaceverifica- http://arxiv.org/abs/1602.02644 tion,”inProceedingsoftheIEEEConferenceonComputer VisionandPatternRecognition,2014,pp.1701–1708. [19] J.Johnson,A.Alahi,andL.Fei-Fei,“Perceptuallossesfor real-time style transfer and super-resolution,” in European [6] A. Lanitis, C. J. Taylor, and T. F. Cootes, “A unified ap- ConferenceonComputerVision,2016. proachtocodingandinterpretingfaceimages,”inComputer Vision, 1995. Proceedings., Fifth International Conference [20] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, on. IEEE,1995,pp.368–373. D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generativeadversarialnets,” inAdvancesinNeuralInfor- [7] D. Erhan, Y. Bengio, A. Courville, and P. Vincent, “Visu- mationProcessingSystems,2014,pp.2672–2680. alizinghigher-layerfeaturesofadeepnetwork,”University ofMontreal, Tech.Rep.1341, Jun.2009, alsopresentedat [21] A. Radford, L. Metz, and S. Chintala, “Unsupervised rep- theICML2009WorkshoponLearningFeatureHierarchies, resentationlearningwithdeepconvolutionalgenerativead- Montre´al,Canada. versarial networks,” International Conference on Learning Representations,2016. [8] K. Simonyan, A. Vedaldi, and A. Zisserman, “Deep inside convolutional networks: Visualising image classification [22] Y.Ganin,E.Ustinova,H.Ajakan,P.Germain,H.Larochelle, models and saliency maps,” CoRR, vol. abs/1312.6034, F. Laviolette, M. Marchand, and V. Lempitsky, “Domain- 2013.[Online].Available:http://arxiv.org/abs/1312.6034 adversarial training of neural networks,” arXiv preprint arXiv:1505.07818,2015. [9] A. Mahendran and A. Vedaldi, “Understanding deep image representations by inverting them,” CoRR, vol. [23] M. Mathieu, C. Couprie, and Y. LeCun, “Deep multi-scale abs/1412.0035, 2014. [Online]. Available: http://arxiv.org/ video prediction beyond mean square error,” International abs/1412.0035 ConferenceonLearningRepresentations,2016.