Weakly-supervised Disentangling with Recurrent Transformations for 3D View Synthesis JimeiYang1,3 ScottReed2 Ming-HsuanYang1 HonglakLee2 1UniversityofCalifornia,Merced 6 {jyang44, mhyang}@ucmerced.edu 1 2UniversityofMichigan,AnnArbor 0 {reedscot, honglak}@umich.edu 2 3AdobeResearch n {jimyang}@adobe.com a J Abstract 5 An important problem for both graphics and vision is to synthesize novel views of a 3D object from a single image. This is particularly challenging due to the ] G partialobservabilityinherentinprojectinga3Dobjectontotheimagespace,and theill-posednessofinferringobjectshapeandpose. However,wecantrainaneu- L ral network to address the problem if we restrict our attention to specific object . s categories (in ourcase faces and chairs) for which wecan gather ample training c data. In this paper, we propose a novel recurrent convolutional encoder-decoder [ network that is trained end-to-end on the task of rendering rotated objects start- 1 ing from a single image. The recurrent structure allows our model to capture v long-term dependencies along a sequence of transformations. We demonstrate 6 the quality of its predictions for human faces on the Multi-PIE dataset and for a 0 datasetof3Dchairmodels,andalsoshowitsabilitytodisentanglelatentfactors 7 ofvariation(e.g.,identityandpose)withoutusingfullsupervision. 0 0 . 1 Introduction 1 0 Numerous graphics algorithms have been established to synthesize photorealistic images from 3D 6 modelsandenvironmentalvariables(lightingandviewpoints), commonlyknownasrendering. At 1 the same time, recent advances in vision algorithms enable computers to gain some form of un- : v derstanding of objects contained in images, such as classification [17], detection [11], segmenta- i tion [19], and caption generation [28], to name a few. These approaches typically aim to deduce X abstractrepresentationsfromrawimagepixels. However, ithasbeenalong-standingproblemfor r both graphics and vision to automatically synthesize novel images by applying intrinsic transfor- a mations (e.g., 3D rotation and deformation) to the subject of an input image. From an artificial intelligenceperspective, this canbeviewed asansweringquestions aboutobjectappearance when theviewangleorilluminationischanged,orsomeactionistaken. Thesesynthesizedimagesmay thenbeperceivedbyhumansinphotoediting[15], orevaluatedbyothermachinevisionsystems, suchasthegameplayingagentwithvision-basedreinforcementlearning[21,22]. In this paper, we consider the problem of predicting transformed appearances of an object when it is rotated in 3D from a single image. In general, this is an ill-posed problem due to the loss of information inherent in projecting a 3D object into the image space. Classic geometry-based ap- proacheseitherrecovera3Dobjectmodelfrommultiplerelatedimages,i.e.,multi-viewstereoand structure-from-motion,orregisterasingleimageofaknownobjectcategorytoitsprior3Dmodel, e.g.,faces[5]. Theresultingmeshcanbeusedtore-renderthescenefromnovelviewpoints. How- ever, having3Dmeshesasintermediaterepresentations, thesemethodsare1)limitedtoparticular objectcategories, 2)vulnerabletoimagealignmentmistakesand3)easytogenerateartifactsdur- ingunseentexturesynthesis. Toovercometheselimitations,weproposealearning-basedapproach without explicit 3D model recovery. Having observed rotations of similar 3D objects (e.g., faces, chairs,householdobjects),thetrainedmodelcanboth1)betterinferthetruepose,shapeandtexture 1 oftheobject,and2)makeplausibleassumptionsaboutpotentiallyambiguousaspectsofappearance in novel viewpoints. Thus, the learning algorithm relies on mappings between Euclidean image spaceandunderlyingnonlinearmanifold. Inparticular,3Dviewsynthesiscanbecastasposeman- ifoldtraversalwhereadesiredrotationcanbedecomposedintoasequenceofsmallsteps. Amajor challenge arises due to the long-term dependency among multiple rotation steps; the key identify- inginformation(e.g., shape, texture)fromtheoriginalinputmustberememberedalongtheentire trajectory. Furthermore, thelocalrotationateachstepmustgeneratethecorrectresultonthedata manifold,orsubsequentstepswillalsofail. Closelyrelatedtotheimagegenerationtaskconsideredinthispaperistheproblemof3Dinvariant recognition,whichinvolvescomparingobjectimagesfromdifferentviewpointsorposeswithdra- maticchangesofappearance. ShepardandMetzlerintheirmentalrotationexperiments[24]found thatthetimetakenforhumanstomatch3Dobjectsfromtwodifferentviewsincreasedproportion- allywiththeangularrotationaldifferencebetweenthem. Itwasasifthehumanswererotatingtheir mentalimagesatasteadyrate. Inspiredbythismentalrotationphenomenon, weproposearecur- rentconvolutionalencoder-decodernetworkwithactionunitstomodeltheprocessofposemanifold traversal. The network consists of four components: a deep convolutional encoder [17], shared identityunits,recurrentposeunitswithrotationactioninputs,andadeepconvolutionaldecoder[8]. Ratherthantrainingthenetworktomodelaspecificrotationsequence, weprovidecontrolsignals ateachtimestepinstructingthemodelhowtomovelocallyalongtheposemanifold. Therotation sequencescanbeofvaryinglength.Toimprovetheeaseoftraining,weemployedcurriculumlearn- ing, similar to that used in other sequence prediction problems [29]. Intuitively, the model should learnhowtomakeone-step15˝rotationbeforelearninghowtomakeaseriesofsuchrotations. The main contributions of this work are summarized as follows. First, a novel recurrent convolu- tionalencoder-decodernetworkisdevelopedforlearningtoapplyout-of-planerotationstohuman faces and 3D chair models. Second, the learned model can generate realistic rotation trajectories withacontrolsignalsuppliedateachstepbytheuser. Third,despiteonlybeingtrainedtosynthe- sizeimages,ourmodellearnsdiscriminativeview-invariantfeatureswithoutusingclasslabels. This weakly-superviseddisentanglingisespeciallynotablewithlonger-termprediction. 2 RelatedWork Thetransformingautoencoder[13]introducesthenotionofcapsulesindeepnetworks,whichtracks boththepresenceandpositionofvisualfeaturesintheinputimage. Thesemodelscanapplyaffine transformationsand3Drotationstoimages. Weaddressasimilartaskofrenderingobjectappear- anceundergoing3Drotations,butweuseaconvolutionalnetworkarchitectureinlieuofcapsules, andincorporateactioninputsandrecurrentstructuretohandlerepeatedrotationsteps. ThePredic- tiveGatingPyramid[20]isdevelopedfortime-seriespredictionandcanlearnimagetransformations includingshiftsandrotationovermultipletimesteps. Ourtaskisrelatedtothistime-seriespredic- tion,butourformulationincludesacontrolsignal,usesdisentangledlatentfeatures,andusescon- volutionalencoderanddecodernetworkstomodeldetailedimages. DingandTaylor[7]proposed agatingnetworktodirectlymodelmentalrotationbyoptimizingtransformingdistance. Insteadof extracting invariant recognition features in one shot, their model learns to perform recognition by exploringaspaceofrelevanttransformations. Similarly,ourmodelcanexplorethespaceofrotation aboutanobjectimagebysettingthecontrolsignalateachtimestepofourrecurrentnetwork. Theproblemoftrainingneuralnetworksthatgenerateimagesisstudiedin[27]. Dosovitskiyetal. [8]proposedaconvolutionalnetworkmappingshape,poseandtransformationlabelstoimagesfor generating chairs. It is able to control these factors of variation and generate high-quality render- ings. Wealsogeneratechairrenderingsinthispaper,butourmodeladdsseveraladditionalfeatures: adeepencodernetwork(sothatwecangeneralizetonovelimages, ratherthanonlydecode), dis- tributedrepresentationsforappearanceandpose,andrecurrentstructureforlong-termprediction. Contemporarytoourwork,theInverseGraphicsNetwork(IGN)[18]alsoaddsanencodingfunc- tiontolearngraphicscodesofimages,alongwithadecodersimilartothatinthechairgenerating network. Asinourmodel,IGNusesadeepconvolutionalencodertoextractimagerepresentations, apply modifications to these, and then re-render. Our model differs in that 1) we train a recur- rentnetworktoperformtrajectoriesofmultipletransformations, 2)weaddcontrolsignalinputat eachstep,and3)weusedeterministicfeed-forwardtrainingratherthanthevariationalauto-encoder (VAE)framework[16](althoughourapproachcouldbeextendedtoaVAEversion). 2 Figure1:Deepconvolutionalencoder-decodernetworkforlearning3drotation A related line of work to ours is disentangling the latent factors of variation that generate natural images. Bilinear models for separating style and content are developed in [26], and are shown to be capable of separating handwriting style and character identity, and also separating face iden- tity and pose. The disentangling Boltzmann Machine (disBM) [23] applies this idea to augment theRestrictedBoltzmannMachinebypartitioningitshiddenstateintodistinctfactorsofvariation and modeling their higher-order interaction. The multi-view perceptron [31] employs a stochastic feedforwardnetworktodisentangletheidentityandposefactorsoffaceimagesinordertoachieve view-invariantrecognition. TheencodernetworkforIGNisalsotrainedtolearnadisentangledrep- resentationofimagesbyextractingagraphicscodeforeachfactor.In[6],the(potentiallyunknown) latentfactorsofvariationarebothdiscoveredanddisentangledusinganovelhiddenunitregularizer. Ourworkisalsolooselyrelatedtothe“DeepStereo"algorithm[10]thatsynthesizesnovelviewsof scenesfrommultipleimagesusingdeepconvolutionalnetworks. 3 RecurrentConvolutionalEncoder-DecoderNetwork Inthissectionwedescribeourmodelformulation. Givenanimageof3Dobject,ourgoalistosyn- thesizeitsrotatedviews. Inspiredbyrecentsuccessofconvolutionalnetworks(CNNs)inmapping imagestohigh-levelabstractrepresentations[17]andsynthesizingimagesfromgraphicscodes[8], webaseourmodelondeepconvolutionalencoder-decodernetworks. Oneexamplenetworkstruc- tureisshowninFigure1. Theencodernetworkused5ˆ5convolution-relulayerswithstride2and 2-pixel padding so that the dimension is halved at each convolution layer, followed by two fully- connected layers. In the bottleneck layer, we define a group of units to represent the pose (pose units) where the desired transformations can be applied. The other group of units represent what doesnotchangeduringtransformations,namedasidentityunits. Thedecodernetworkissymmetric totheencoder. Toincreasedimensionalityweusefixedupsamplingasin[8]. Wefoundthatfixed stride-2convolutionandupsamplingworkedbetterthanmax-poolingandunpoolingwithswitches, because when applying transformations the encoder pooling switches would not in general match theswitchesproducedbythetargetimage. Thedesiredtransformationsarereflectedbytheaction units. Weuseda1-of-3encoding,inwhichr100sencodedaclockwiserotation,r010sencodedano- op,andr001sencodedacounter-clockwiserotation. Thetriangleindicatesatensorproducttaking asinputtheposeunitsandactionunits,andproducingthetransformedposeunits. Equivalently,the actionunitselectsthematrixthattransformstheinputposeunitstotheoutputposeunits. The action units introduce a small linear increment to the pose units, which essentially model the localtransformationsinthenonlinearposemanifold. However,inordertoachievelongerrotation trajectories, if we simply accumulate the linear increments from the action units (e.g., [2 0 0] for two-step clockwise rotation, the pose units will fall off the manifold resulting in bad predictions. To overcome this problem, we generalize the model to a recurrent neural network, which have beenshowntocapturelong-termdependenciesforawidevarietyofsequencemodelingproblems. In essence, we use recurrent pose units to model the step-by-step pose manifold traversals. The identityunitsaresharedacrossalltimestepssinceweassumethatalltrainingsequencespreserve theidentitywhileonlychangingthepose. Figure2showstheunrolledversionofourRNNmodel. Weonlyperformencodingatthefirsttimestep,andalltransformationsarecarriedoutinthelatent space;i.e.,themodelpredictionsattimesteptarenotfedintothenexttimestepinput. Thetraining objectiveisbasedonpixel-wisepredictionoveralltimestepsfortrainingsequences: ÿN ÿT L “ ||ypi,tq´gpf pxpiq,apiq,tq,f pxpiqqq||2 (1) rnn pose id 2 i“1t“1 3 Figure2:Unrolledrecurrentconvolutionalnetworkforlearningtorotate3dobjects.Theconvolutionalencoder anddecoderhavebeenabstractedout,representedhereasverticalrectangles. whereapiq isthesequenceofT actions,f pxpiqqproducestheidentityfeaturesinvarianttoallthe id time steps, f pxpiq,apiq,tq produces the transformed pose features at time step t, gp¨,¨q is the pose imagedecoderproducinganimagegiventheoutputoff p¨qandf p¨,¨,¨q,xpiqisthei-thimage, id pose ypi,tqisthei-thtrainingimagetargetatstept. 3.1 CurriculumTraining WetrainedthenetworkparametersusingbackpropagationthroughtimeandtheADAMoptimization method [3]. To effectively train our recurrent network, we found it beneficial to use curriculum learning [4], in which we gradually increase the difficulty of training by increasing the trajectory length. Thisappearstobeusefulforsequencepredictionwithrecurrentnetworksinotherdomains aswell[22,29]. InSection4,weshowthatincreasingthetrainingsequencelengthimprovesboth themodel’simagepredictionperformanceaswellasthepose-invariantrecognitionperformanceof identityfeatures. Also, longer training sequences force the identity units to better disentangle themselves from the pose. If the same identity units need to be used to predict both a 15˝-rotated and a 120˝-rotated imageduringtraining,theseunitscannotpickuppose-relatedinformation. Inthisway,ourmodel canlearndisentangledfeatures(i.e., identityunitscandoinvariantidentityrecognitionbutarenot informativeofpose,andviceversa)withoutexplicitlyregularizingtoachievethiseffect.Wedidnot finditnecessarytousegradientclipping. 4 Experiments Wecarryoutexperimentstoachievethefollowingobjectives. First, weexaminetheabilityofour model to synthesize high-quality images of both face and complex 3D objects (chairs) in a wide range of rotational angles. Second, we evaluate the discriminative performance of disentangled identityunitsthroughcross-viewobjectrecognition. Third,wedemonstratetheabilitytogenerate androtatenovelobjectclassesbyinterpolatingidentityunitsofqueryobjects. 4.1 Datasets Multi-PIE. The Multi-PIE [12] dataset consists of 754,204 face images from 337 people. The imagesarecapturedfrom15viewpointsunder20illuminationconditionsindifferentsessions. To evaluate our model for rotating faces, we select a subset of Multi-PIE that covers 7 viewpoints evenlyfrom´45˝ to45˝ underneutralillumination. Eachfaceimageisalignedthroughmanually annotatedlandmarksoneyes,noseandmouthcorners,andthencroppedto80ˆ60ˆ3pixels. We usetheimagesoffirst200peoplefortrainingandtheremaining137peoplefortesting. Chairs. Thisdatasetcontains1393chairCADmodelsmadepubliclyavailablebyAubryetal.[2]. Each chair model is rendered from 31 azimuth angles (with steps of 11˝ or 12˝) and 2 elevation angles(20˝ and30˝)atafixeddistancetothevirtualcamera. Weuseasubsetof809chairmodels inourexperiments,whichareselectedoutof1393byDosovitskiyetal.[8]inordertoremovenear- duplicatemodels(e.g.,modelsdifferingonlyincolor)orlow-qualitymodels. Wecroptherendered images to have a small border and resize them to a common size of 64ˆ64ˆ3 pixels. We also preparetheirbinarymasksbysubtractingthewhitebackground. Weusetheimagesofthefirst500 modelsasthetrainingsetandtheremaining309modelsasthetestset. 4 ´45˝ ´30˝ ´15˝ 0˝ 15˝ 30˝ 45˝ ´45˝ ´30˝ ´15˝ 0˝ 15˝ 30˝ 45˝ Figure3: 3DviewsynthesisonMulti-PIE.Foreachpanel, thefirstrowshowsthegroundtruthfrom´45˝ to45˝,thesecondandthirdrowsshowthere-renderingsof6-stepclockwiserotationfromaninputimageof ´45˝(redbox)andof6-stepcounter-clockwiserotationfromaninputimageof45˝(redbox),respectively. 45˝ 30˝ 15˝ ´15˝ ´30˝ ´45˝ 45˝ 30˝ 15˝ ´15˝ ´30˝ ´45˝ Input RNN 3D model Figure4:Comparingfaceposenormalizationresultswith3Dmorphablemodel[30]. 4.2 NetworkArchitecturesandTrainingDetails Multi-PIE. TheencodernetworkfortheMulti-PIEdatasetusedtwoconvolution-relulayerswith stride2and2-pixelpadding,followedbyonefully-connectedlayer:5ˆ5ˆ64´5ˆ5ˆ128´1024.The numberofidentityandposeunitsare512and128,respectively. Thedecodernetworkissymmetric totheencoder. Thecurriculumtrainingprocedurestartswiththesingle-steprotationmodelwhich wecallRNN1. We prepare the training samples by pairing face images of the same person captured in the same sessionwithadjacentcameraviewpoints. Forexample,xpiqat´30˝ismappedtoypiqat´15˝with action apiq “r001s; xpiq at ´15˝ is mapped to ypiq at ´30˝ with action apiq “r100s; and xpiq at ´30˝ ismappedtoypiq at´30˝ withactionapiq“r010s. Forfaceimageswithendingviewpoints ´45˝ and45˝,onlyone-wayrotationisfeasible. WetrainthenetworkusingtheADAMoptimizer withfixedlearningrate10´4for400epochs.1 Sincethereare7viewpointsperpersonpersession,weschedulethecurriculumtrainingwitht“2, t“4 and t“6 stages, which we call RNN2, RNN4 and RNN6, respectively. To sample training sequenceswithfixedlength,weallowbothclockwiseandcounter-clockwiserotations.Forexample, whent“4,oneinputimagexpiqat30˝ismappedtopypi,1q,ypi,2q,ypi,3q,ypi,4qqwithcorresponding anglesp45˝,30˝,15˝,0˝qandactioninputspr001s,r100s,r100s,r100sq. Ineachstage,weinitialize the network parameters with the previous stage and fine-tune the network with fixed learning rate 10´5for10additionalepochs. Chairs. Theencodernetworkforchairsusedthreeconvolution-relulayerswithstride2and2-pixel padding,followedbytwofully-connectedlayers:5ˆ5ˆ64´5ˆ5ˆ128´5ˆ5ˆ256´1024´1024. Thedecodernetworkissymmetric,exceptthatafterthefully-connectedlayersitbranchesintoimage andmaskpredictionlayers. Themaskpredictionindicateswhetherapixelbelongstoforegroundor background. WeadoptedthisideafromthegenerativeCNN[8]andfounditbeneficialtotraining efficiencyandimagesynthesisquality.Atradeoffparameterλ“0.1isappliedtothemaskprediction loss. Wetrainthesingle-stepnetworkparameterswithfixedlearningrate10´4 for500epochs. We schedulethecurriculumtrainingwitht“2, t“4, t“8andt“16, whichwecallRNN2, RNN4, RNN8andRNN16. Notethatthecurriculumtrainingstopsatt“16becausewereachedthelimit of GPU memory. Since the images of each chair model are rendered from 31 viewpoints evenly sampled between 0˝ and 360˝, we can easily prepare training sequences of clockwise or counter- clockwiset-steprotationsaroundthecircle. Similarly,thenetworkparametersofthecurrentstage areinitializedwiththoseofpreviousstageandfine-tunedwiththelearningrate10´5for50epochs. 1WecarryoutexperimentsusingCaffe[14]onNvidiaK40candTitanXGPUs. 5 4.3 3DViewSynthesisofNovelObjects Wefirstexaminethere-renderingqualityofourRNNmodelsfornovelobjectinstancesthatwere not seen during training. On the Multi-PIE dataset, given one input image from the test set with possible views between ´45˝ to 45˝, the encoder produces identity units and pose units and then thedecoderrendersimagesprogressivelywithfixedidentityunitsandaction-drivenrecurrentpose units up to t-steps. Examples are shown in Figure 3 of the longest rotations, i.e., clockwise from ´45˝ to 45˝ and counter-clockwise from 45˝ to ´45˝ with RNN6. High-quality renderings are generatedwithsmoothtransformationsbetweenadjacentviews. Thecharacteristicsoffaces, such asgender, expression, eyes, noseandglassesarealsopreservedduringrotation. Wealsocompare our RNN model with a state-of-the-art 3D morphable model for face pose normalization [30] in Figure4. ItcanbeobservedthatourRNNmodelproducesstablerenderingswhile3Dmorphable modelissensitivetofaciallandmarklocalization. Oneoftheadvantagesof3Dmorphablemodelis thatitpreservesfacialtextureswell. On the chair dataset, we use RNN16 to synthesize 16 rotated views of novel chairs in the test set. Givenachairimageofacertainview, wedefinetwoactionsequences; oneforprogressiveclock- wise rotation and another for counter-clockwise rotation. It is a more challenging task compared to rotating faces due to the complex 3D shapes of chairs and the large rotation angles (more than 180˝ after 16-step rotations). Since no previous methods tackle the exact same chair re-rendering problem,weuseak-nearest-neighbor(KNN)methodforbaselinecomparisons. TheKNNbaseline is implemented as follows. We first extract the CNN features “fc7” from VGG-16 net [25] for all the chair images. For each test chair image, we find its K-nearest neighbors in the training set by comparingtheir“fc7”features. Theretrievedtop-Kimagesareexpectedtobesimilartothequery intermsofbothstyleandpose[1]. Givenadesiredrotationangle,wesynthesizerotatedviewsof the test image by averaging the corresponding rotated views of the retrieved top-K images in the trainingsetatthepixellevel. WetunetheKvaluein[1,3,5,7],namelyKNN1,KNN3,KNN5and KNN7toachievethebestperformance. TwoexamplesareshowninFigure5. InourRNNmodel, the 3D shapes are well preserved with clear boundaries for all the 16 rotated views from different input,andtheappearancechangessmoothlybetweenadjacentviewswithaconsistentstyle. RNN KNN RNN KNN RNN KNN RNN KNN Input t=1 t=2 t=3 t=4 t=5 t=6 t=7 t=8 t=9 t=10 t=11 t=12 t=13 t=14 t=15 t=16 Figure5: 3Dviewsynthesisof16-steprotationsonChairs. Ineachpanel, wecomparesynthesisresultsof theRNN16model(top)andoftheKNN5baseline(bottom). Thefirsttwopanelsbelongtothesamechairof differentstartingviewswhilethelasttwopanelsarefromanotherchairoftwostartingviews.Inputimagesare markedwithredboxes. Notethatconceptuallythelearnednetworkparametersduringdifferentstagesofcurriculumtraining can be used to process an arbitrary number of rotation steps. The RNN1 model (the first row in Figure 6) works well in the first rotation step, but it produces degenerate results from the second step. TheRNN2(thesecondrow), trainedwithtwo-steprotations, generatesreasonableresultsin the third step. Progressively, the RNN4 and RNN8 seem to generalize well on chairs with longer 6 RNN1 700 KNN1 KNN3 KNN5 RNN2 600 KNN7 RNN1 or RNN2 RNN4 on err500 RRRNNNNNN4816 ucti400 RNN8 str n o c300 e RNN16 R 200 GT Model t=1 t=2 t=3 t=4 t=6 t=8 t=12 t=16 1001 2 3 4 5 #6 of 7rot8atio9n s1t0ep1s1 12 13 14 15 16 Figure6: Comparingchairsynthesisresultsfrom Figure 7: Comparing reconstruction mean squared RNNatdifferentcurriculumstages. errors(MSE)onchairswithRNNsandKNNs. predictions(t “ 6forRNN4andt “ 12forRNN8). Wemeasurethequantitativeperformanceof KNNandourRNNbythemeansquarederror(MSE)in(1)inFigure7. Asaresult,thebestKNN with 5 retrievals (KNN5) obtains „310 MSE, which is comparable to our RNN4 model, but our RNN16modelsignificantlyoutperformsKNN5(„179MSE)witha42%relativeimprovement. 4.4 Cross-ViewObjectRecognition Inthisexperiment,weexamineandcomparethediscriminativeperformanceofdisentangledrepre- sentationsthroughcross-viewobjectrecognition. Multi-PIE. We create 7 gallery/probe splits from the test set. In each split, the face images of thesameview,e.g.,´45˝ arecollectedasgalleryandtherestofotherviewsasprobes. Weextract 512-dfeaturesfromtheidentityunitsofRNNsforallthetestimagessothattheprobesarematched to the gallery by their cosine distance. It is considered as a success if the matched gallery image has the same identity with one probe. We also categorize the probes in each split by measuring their angle offsets from the gallery. In particular, the angle offsets range from 15˝ to 90˝. The recognitiondifficultiesincreasewithangleoffsets. Todemonstratethediscriminativeperformance of our learned representations, we also implement a convolutional network classifier. The CNN architectureissetupbyconnectingourencoderandidentityunitswitha200-waysoftmaxoutput layer, and its parameters are learned on the training set with ground truth class labels. The 512-d features extracted from the layer before the softmax layer are used to perform cross-view object recognition as above. Figure 8 (left) compares the average success rates of RNNs and CNN with theirstandarddeviationsover7splitsforeachangleoffset. ThesuccessratesofRNN1dropmore than20%fromangleoffset15˝to90˝.Thesuccessrateskeepimprovingingeneralwithcurriculum trainingofRNNs,andthebestresultsareachievedwithRNN6. Asexpected,theperformancegap forRNN6between15˝to90˝reducesto10%.ThisphenomenondemonstratesthatourRNNmodel gradually learns pose/viewpoint-invariant representations for 3D face recognition. Without using anyclasslabels,ourRNNmodelachievescompetitiveresultsagainsttheCNN. Chairs. The experimental setup is similar to Multi-PIE. There are in total 31 azimuth views per chair instance. For each view, we create its gallery/probe split so that we have 31 splits. We ex- tract 512-d features from identity units of RNN1, RNN2, RNN4, RNN8 and RNN16. The probes for each split are sorted by their angle offsets from the gallery images. Note that this experiment is particularly challenging because chair matching is a fine-grained recognition task and chair ap- pearances change significantly with 3D rotations. We also compare our model against CNN, but insteadoftrainingCNNfromscratchweusethepre-trainedVGG-16net[25]toextractthe4096-d “fc7”featuresforchairmatching. ThesuccessratesareshowninFigure8(right). Theperformance dropsquicklywhentheangleoffsetisgreaterthan45˝, buttheRNN16significantlyimprovesthe overall success rates especially for large angle offsets. We notice that the standard deviations are large around the angle offsets 70˝ to 120˝. This is because some views contain more information aboutthechair3Dshapesthantheotherviewssothatweseeperformancevariations. Interestingly, the performance of VGG-16 net surpasses our RNN model when the angle offset is greater than 120˝. We hypothesize that this phenomenon results from the symmetric structures of most of the chairs. TheVGG-16netwastrainedwithmirroringdataaugmentationtoachievecertainsymmetric invariancewhileourRNNmodeldoesnotexplorethisstructure. 7 TofurtherdemonstratethedisentanglingpropertyofourRNNmodel,weusetheposeunitsextracted from the input images to repeat the above cross-view recognition experiments. The mean success ratesareshowninTable1. Itturnsoutthatthebettertheidentityunitsperformtheworsethepose unitsperform. Whentheidentityunitsachievenear-perfectrecognitiononMulti-PIE,theposeunits onlyobtainameansuccessrate1.4%,whichisclosetotherandomguess0.5%for200classes. RNN1 RNN2 RNN4 RNN6 CNN RNN1 RNN2 RNN4 RNN8 RNN16 VGG-16 100 100 95 90 )%( seta 90 )%( seta 7800 r sse 85 r sse 60 ccu 80 ccu 50 s noita 75 s noita 40 cifissa 70 cifissa 2300 lC lC 65 10 60 0 20 30 40 50 60 70 80 90 20 40 60 80 100 120 140 160 180 Angle offset Angle offset Figure8:Comparingcross-viewrecognitionsuccessratesforfaces(left)andchairs(right). Models RNN:identity RNN:pose CNN Multi-PIE 93.3 1.4 92.6 Chairs 56.8 9.0 52.5 Table1:Comparingmeancross-viewrecognitionsuccessrates(%)withidentityandposeunits. 4.5 ClassInterpolationandViewSynthesis In this experiment, we demonstrate the ability of our RNN model to generate novel chairs by interpolating between two existing ones. Given two chair images of the same view from dif- ferent instances, the encoder network is used to compute their identity units z1 ,z2 and pose id id units z1 ,z2 , respectively. The interpolation is computed by z “ βz1 ` p1 ´ βqz2 and pose pose id id id z “ βz1 `p1´βqz2 , where β “ r0.0,0.2,0.4,0.6,0.8,1.0s. The interpolated z and pose pose pose id z arethenfedintotherecurrentdecodernetworktorenderitsrotatedviews. Exampleinterpo- pose lationsbetweenfourchairinstancesareshowninFigure9. TheInterpolatedchairspresentsmooth stylistictransformationsbetweenanypairofinputclasses(eachrowinFigure9),andtheirunique stylisticcharacteristicsarealsowellpreservedamongitsrotatedviews(eachcolumninFigure9). Input t=1 t=5 t=9 t=13 β 0.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0 Figure9: Chairstyleinterpolationandviewsynthesis. Givenfourchairimagesofthesameview(firstrow) fromtestset, eachrowpresentsrenderingsofstylemanifoldtraversalwithafixedviewwhileeachcolumn presentstherenderingsofposemanifoldtraversalwithafixedinterpolatedidentity. 5 Conclusion In this paper, we develop a recurrent convolutional encoder-decoder network and demonstrate its effectiveness for synthesizing 3D views of unseen object instances. On the Multi-PIE dataset and a database of 3D chair CAD models, the model predicts accurate renderings across trajectories of 8 repeated rotations. The proposed curriculum training by gradually increasing trajectory length of trainingsequencesyieldsbothbetterimageappearanceandmorediscriminativefeaturesforpose- invariantrecognition. Wealsoshowthatatrainedmodelcouldinterpolateacrosstheidentitymani- foldofchairsatfixedpose,andtraversetheposemanifoldwhilefixingtheidentity. Thisgenerative disentangling of chair identity and pose emerged from our recurrent rotation prediction objective, even though we do not explicitly regularize the hidden units to be disentangled. Our future work includesintroducingmoreactionsintotheproposedmodelotherthanrotation,handlingobjectsem- beddedincomplexscenes,andhandlingone-to-manymappingsforwhichatransformationyieldsa multi-modaldistributionoverfuturestatesinthetrajectory. Acknowledgments ThisworkwassupportedinpartbyONRN00014-13-1-0762,NSFCAREER IIS-1453651,andNSFCMMI-1266184. WethankNVIDIAfordonatingaTeslaK40GPU. References [1] M.AubryandB.C.Russell. Understandingdeepfeatureswithcomputer-generatedimagery. InICCV, 2015. [2] M.Aubry, D.Maturana, A.A.Efros, B.Russell, andJ.Sivic. Seeing3Dchairs: exemplarpart-based 2D-3DalignmentusingalargedatasetofCADmodels. InCVPR,2014. [3] J.BaandD.Kingma. Adam:Amethodforstochasticoptimization. InICLR,2015. [4] Y.Bengio,J.Louradour,R.Collobert,andJ.Weston. Curriculumlearning. InICML,2009. [5] V.BlanzandT.Vetter. Amorphablemodelforthesynthesisof3Dfaces. InSIGGRAPH,1999. [6] B. Cheung, J. Livezey, A. Bansal, and B. Olshausen. Discovering hidden factors of variation in deep networks. InICLR,2015. [7] W.DingandG.Taylor. Mentalrotationbyoptimizingtransformingdistance. InNIPSDeepLearning andRepresentationLearningWorkshop,2014. [8] A. Dosovitskiy, J. Springenberg, and T. Brox. Learning to generate chairs with convolutional neural networks. InCVPR,2015. [9] S.Fidler,S.Dickinson,andR.Urtasun. 3Dobjectdetectionandviewpointestimationwithadeformable 3Dcuboidmodel. InNIPS,2012. [10] J.Flynn,I.Neulander,J.Philbin,andN.Snavely. Deepstereo: Learningtopredictnewviewsfromthe world’simagery. arXivpreprintarXiv:1506.06825,2015. [11] R.Girshick. FastR-CNN. InICCV,2015. [12] R.Gross,I.Matthews,J.Cohn,T.Kanade,andS.Baker. Multi-PIE. ImageandVisionComputing,28 (5):807–813,May2010. [13] G.E.Hinton,A.Krizhevsky,andS.D.Wang. Transformingauto-encoders. InICANN,2011. [14] Y.Jia,E.Shelhamer,J.Donahue,S.Karayev,J.Long,R.Girshick,S.Guadarrama,andT.Darrell.Caffe: Convolutionalarchitectureforfastfeatureembedding. arXivpreprintarXiv:1408.5093,2014. [15] N.Kholgade,T.Simon,A.Efros,andY.Sheikh. 3Dobjectmanipulationinasinglephotographusing stock3Dmodels. InSIGGRAPH,2014. [16] D.P.KingmaandM.Welling. Auto-encodingvariationalBayes. InICLR,2014. [17] A.Krizhevsky, I.Sutskever, andG.E.Hinton. Imagenetclassificationwithdeepconvolutionalneural networks. InNIPS,2012. [18] T.D.Kulkarni,W.Whitney,P.Kohli,andJ.B.Tenenbaum.Deepconvolutionalinversegraphicsnetwork. InNIPS,2015. [19] J.Long,E.Shelhamer,andT.Darrell.Fullyconvolutionalnetworksforsemanticsegmentation.InCVPR, 2015. [20] V.Michalski,R.Memisevic,andK.Konda.Modelingdeeptemporaldependencieswithrecurrent“gram- marcells”. InNIPS,2014. [21] V.Mnih,K.Kavukcuoglu,D.Silver,A.Graves,I.Antonoglou,D.Wierstra,andM.Riedmiller. Playing atariwithdeepreinforcementlearning. InNIPSDeepLearningWorkshop,2013. [22] J.Oh,X.Guo,H.Lee,R.Lewis,andS.Singh. Action-conditionalvideopredictionusingdeepnetworks inatarigames. InNIPS,2015. [23] S. Reed, K. Sohn, Y. Zhang, and H. Lee. Learning to disentangle factors of variation with manifold interaction. InICML,2014. 9 [24] R.N.ShepardandJ.Metzler.Mentalrotationofthreedimensionalobjects.Science,171(3972):701–703, 1971. [25] K.SimonyanandA.Zisserman. Verydeepconvolutionalnetworksforlarge-scaleimagerecognition. In ICLR,2015. [26] J.B.TenenbaumandW.T.Freeman. Separatingstyleandcontentwithbilinearmodels. NeuralCompu- tation,12(6):1247–1283,2000. [27] T.Tieleman. Optimizingneuralnetworksthatgenerateimages. PhDthesis,UniversityofToronto,2014. [28] O.Vinyals,A.Toshev,S.Bengio,andD.Erhan. Showandtell: Aneuralimagecaptiongenerator. In CVPR,2015. [29] W.ZarembaandI.Sutskever. Learningtoexecute. arXivpreprintarXiv:1410.4615,2014. [30] X. Zhu, Z. Lei, J. Yan, D. Yi, and S. Z. Li. High-fidelity pose and expression normalization for face recognitioninthewild. InCVPR,2015. [31] Z.Zhu,P.Luo,X.Wang,andX.Tang. Multi-viewperceptron: adeepmodelforlearningfaceidentity andviewrepresentations. InNIPS,2014. Appendix 5.1 Trainingthechairmodelwithmaskstream. Figure10:Deepconvolutionalencoder-decodernetworkfortrainingthechairmodelwithmaskstream. Ontrainingthechairmodel,inadditiontodecodingtherotatedchairimages,wealsodecodetheirbinarymasks. ThelayerstructureisillustratedinFigure10. Beinganeasierpredictionobjectivethantheimage,thebinary maskprovidesaproperregularizationtothenetworkandsignificantlyimprovesthepredictionperformance. WecomparethelearningcurveswithandwithoutmaskstreamfortrainingRNN1inFigure11. Itisnotable thatthetraininglossforthenetworkwithmaskstreamdeceasesfasteranditstestinglosstendstoconverge better. Figure 11: Comparing learning curves with and without mask stream for training the RNN1 model on the Chairsdataset. 5.2 Resultsoncars. Wealsoevaluateourmodelonsynthesize3Dviewsofcarsfromasingeimage. WeusethecarCADmodels collected by [9]. For each of 183 CAD models, we generate 64x64 grayscale renderings from 24 azimuth angleseachoffsetby15degreesand4elevationangles[0,6,12,18]. Therenderingsfromfirst150modelsare 10