Binge Watching: Scaling Affordance Learning from Sitcoms XiaolongWang RohitGirdhar AbhinavGupta ⇤ ⇤ TheRoboticsInstitute,CarnegieMellonUniversity http://www.cs.cmu.edu/˜xiaolonw/affordance.html Abstract Query Image Query Image In recent years, there has been a renewed interest in jointlymodelingperceptionandaction. Atthecoreofthis investigationistheideaofmodelingaffordances1. However, whenitcomestopredictingaffordances, eventhestateof the art approaches still do not use any ConvNets. Why is that? Unlikesemanticor3Dtasks,therestilldoesnotexist Retrievals Retrievals anylarge-scaledatasetforaffordances. Inthispaper, we tacklethechallengeofcreatingoneofthebiggestdataset forlearningaffordances. Weusesevensitcomstoextracta diversesetofscenesandhowactorsinteractwithdifferent objectsinthescenes. Ourdatasetconsistsofmorethan10K scenesand28Kwayshumanscaninteractwiththese10K images. We also propose a two-step approach to predict affordancesinanewscene. Inthefirststep,givenalocation inthesceneweclassifywhichofthe30poseclassesisthe likelyaffordancepose. Giventheposeclassandthescene, wethenuseaVariationalAutoencoder(VAE)[23]toextract thescaleanddeformationofthepose. TheVAEallowsusto … … samplethedistributionofpossibleposesattesttime. Finally, weshowtheimportanceoflarge-scaledatainlearninga Transferred Poses Transferred Poses generalizableandrobustmodelofaffordances. 1.Introduction Oneofthelong-termgoalsofcomputervision,asitin- tegrateswithrobotics,istotranslateperceptionintoaction. Figure1.Weproposetobinge-watchsitcomstoextractoneofthe While vision tasks such as semantic or 3D understanding largestaffordancedatasetsever. Weusemorethan100Mframes have seen remarkable improvements in performance, the fromsevendifferentsitcomstofindemptyscenesandsamescene taskoftranslatingperceptionintoactionshasnotseenany withhumans. Thisallowsustocreatealarge-scaledatasetwith majorgains. Forexample,thestateoftheartapproachesin scenesandtheiraffordances. predictingaffordancesstilldonotuseanyConvNetswith theexceptionof [12]. Whyisthat? Whatiscommonacross the tasks affected by ConvNets is the availability of large structured light cameras such as the Kinect. But no such scalesupervisions. Forexample,insemantictasks,thesu- datasets exist for supervising actions afforded by a scene. pervision comes from crowd-sourcing tools like Amazon Canwecreatealarge-scaledatasetthatcanalterthecourse MechanicalTurk;andin3Dtasks,supervisioncomesfrom inthisfieldaswell? There are several possible ways to create a large-scale ⇤Indicatesequalcontribution. 1Affordancesareopportunitiesofinteractioninthescene. Inother datasetforaffordances: (a)Firstoptionistolabelthedata: words,itrepresentswhatactionscantheobjectbeusedfor. givenemptyimagesofroom,wecanaskmechanicalturkers 1 tolabelwhatactionscanbedoneatdifferentlocations.How- years,theideaoffunctionalrecognitiontookbackstagebe- ever,labelingimageswithaffordancesisextremelydifficult causetheseapproacheslackedtheabilitytolearnfromdata andanunscalablesolution. (b)Thesecondoptionistoau- andhandlethenoisyinputimages. tomaticallygeneratedatabydoingactionsthemselves. One On the other hand, we have made substantial progress caneitheruserobotsandreinforcementlearningtoexplore in the field of semantic image understanding. This is pri- the worldand theaffordances. However, collectinglarge- marilyduetotheresultofavailabilityoflargescaletrain- scalediversedataisnotyetscalableinthismanner. (c)A ing datasets [8, 31] and high capacity models like Con- thirdoptionistousesimulation: onesuchexampleis [12] vNets[29,28]. However,thesuccessofConvNetshasnot wheretheyusetheblockgeometricmodeloftheworldto resultedinsignificantgainsforthefieldoffunctionalrecog- knowwherehumanskeletonswouldfit.However,thismodel nition. Ourhypothesisisthatthisisduetothelackoflarge onlycapturesphysicallylikelyactionsanddoesnotcapture scaletrainingdatasetsforaffordances. Whileitiseasytola- thestatisticalprobabilitiesbehindeveryaction. Forexample, belobjectsandscenes,labelingaffordancesisstillmanually itallowspredictionssuchashumanscansitontopofstoves; intensive. andfortheopenspaceneardoorsitpredictswalkingasthe Therearetwoalternativestoovercomethisproblem.First topprediction(eventhoughitshouldbereachingthedoor). istoestimateaffordancesbyusingreasoningontopofse- Inthispaper,weproposeanotheralternative: watchthe mantic[6,21,4]and3D[10,2,41,15]sceneunderstand- humansdoingtheactionsandusethosetolearnaffordances ing. Therehasbeenalotofrecentworkwhichfollowthis ofobjects. Buthowdowefindlarge-scaledatatodothat? alternative: [18,25]modelrelationshipbetweensemantic We propose to binge-watch sitcoms to extract one of the object classes and actions; Yao et al. [42] model relation- largestaffordancedatasetsever. Specifically,weuseevery shipsbetweenobjectandposes. Theserelationshipscanbe episodeandeveryframeofsevensitcoms2whichamounts learnedfromvideos[18], staticimages[42]oreventime- toprocessingmorethan100Millionframestoextractparts lapsevideos[7]. Recently,[45]proposedawaytoreason ofsceneswithandwithouthumans. Wethenperformauto- aboutobjectaffordancesbycombiningobjectcategoriesand maticregistrationtechniquesfollowedbymanualcleaning attributes in a knowledge base manner. Apart from using totransferposesfromsceneswithhumanstosceneswithout semantics, 3D properties have also been used to estimate humans. This leads to a dataset of 28882 poses in empty affordances [19, 17, 11, 43, 5]. Finally, there have been scenes. efforts to use specialized sensors such as Kinect to esti- Wethenusethisdatatolearnamappingfromscenesto mategeometryfollowedbyestimatingaffordancesaswell affordances. Specifically,weproposeatwo-stepapproach. [22,26,27,46]. In the first step, given a location in the scene we classify Whilethefirstalternativetriestoestimateaffordancesin which of the 30 pose classes (learned from training data) low-dataregime,asecondalternativeistocollectdatafor isthelikelyaffordancepose. Giventheposeclassandthe affordanceswithoutaskinghumanstolabeleachandevery scene, we then use the Variational Autoencoder (VAE) to pixel. Onepossiblewayistohaverobotsthemselvesexplore extract the scale and deformation of the pose. Instead of theworldandcollectdataforhowdifferentobjectscanused. givingasingleansweroraveragingthedeformations,VAE For example, [34] uses self-supervised learning to learn allows us to sample the distribution of possible poses at graspingaffordancesofobjectsor[1,33]focusonlearning test time. We show that training an affordance model on pushingaffordances. However,usingrobotsforaffordance large-scaledatasetleadstoamoregeneralizableandrobust supervisionisstillnotascalablesolutionsincecollecting model. thisdatarequiresalotofeffort. Anotherpossibilityistouse simulations[32]. Forexample,Fouheyetal.[12]proposea 3D-Humanposesimulatorwhichcancollectlargescaledata 2.RelatedWork using3Dposefitting. Butthisdataonlycapturesphysical Theideaofaffordances[14]wasproposedbyJamesJ. notion of affordances and does not capture the statistical Gibsoninlateseventies,wherehedescribedaffordancesas probabilitiesbehindeveryaction.Inthiswork,weproposeto “opportunitiesforinteractions”providedbytheenvironment. collectoneofthebiggestaffordancedatasetsusingsitcoms InspiredbyGibson’sideas,ourfieldhastimeandagainfid- and minimal human inputs. Our approach sifts through dledwiththeideaoffunctionalrecognition[38,36]. Inmost more than 100M frames to find high-quality scenes and cases, the common approach is to first estimate physical correspondinghumanposestolearnaffordanceproperties attributesandthenreasonaboutaffordances. Specifically, ofobjects. manually-definedrulesareusedtoreasonaboutshapeand geometry to predict affordances [38, 40]. However, over 3.SitcomAffordanceDataset 2HowIMetYourMother,Friends,TwoandaHalfMen,Frasier,Sein- Ourfirstgoaltowardsdata-drivenaffordancesistocollect field,TheBigBangTheory,EverybodyLovesRaymond a large scale dataset for affordances. What we need is an 2 Figure2.SomeexampleimagesfromSitcomAffordancedataset. Notethatourimagesarequitediverseandwehavelargesamplesof possibleactionsperimage. imagedatasetofscenessuchaslivingrooms,bedroomsetc we first filter out scenes based on the size of the largest andwhatactionscanbeperformedindifferentpartsofthe facedetectedinthescenes. WealsoappliedFast-RCNNto scene. Inthispaper,inspiredbysomerecentwork[20],we detecthumans[16]inthescene. Werejectthesceneswhere representtheoutputspaceofaffordancesintermsofhuman humansaredetected. Finally,wehavealsotrainedaCNN poses. Butwherecanwefindimagesofthesamescenewith classifierforemptyscenes.Thepositivetrainingdataforthis orwithoutpeopleinit? classifierarescenesinSUN-RGBD[37]andMIT67[35]; Theanswertotheabovequestionliesinexploitingthe the negative data are random video frames from the TV TVSitcoms. Insitcoms,charactersshareacommonenvi- seriesandImages-of-Groups[13]. Theclassifierisfinetuned ronment,suchasahomeorworkplace. Ascenewithexact onPlaceNet[44]. Aftertrainingthisclassifier,weapplyit configurationofobjectsappearsagainandagainasmulti- backontheTVseriestrainingdataandselect1000samples pleepisodesareshotinit. Forexample,thelivingroomin withthehighestpredictionscores. Wemanuallylabelthese Friendsappearsinallthe10seasonsand 240episodesand 1000imagesandusethemtofine-tunedtheclassifieragain. eachactorlabelstheactionspaceinthesceneonebyoneas This“hardnegative”miningprocedureturnsouttobevery theyperformdifferentactivities. effectiveandimprovethegeneralizationpoweroftheCNN Weusesevensuchsitcomsandprocessmorethan100M acrossallTVseries. framesofvideotocreatethelargestaffordancedataset. We followathree-stepapproachtocreatethedataset: (1)Asa PeopleWatching: FindingSceneswithPeople firststep,weminethe100Mframestofindemptyscenesor sub-scenes. Weuseanemptysceneclassifierinconjunction Weusetwosearchstrategiestofindsceneswithpeople. Our with face and person detector to find these scenes; (2) In firststrategyistouseimageretrievalwhereweuseempty thesecondstep,weusetheemptyscenestofindthesame scenesasqueryimagesandalltheframesintheTV-series scenes but with people performing actions. We use two as retrieval dataset. We use cosine distance on the pool5 strategiestosearchforframeswithpeopleperformingac- featuresextractedbyImageNetpre-trainedAlexNet. Inour tionsandtransfertheestimatedposes[3]toemptyscenesby experiments,wefindthepool5featuresarerobusttosmall simplealignmentprocedure;(3)Inthefinalstep,weperform changesoftheimage,suchasthedecorationsandnumber manualfilteringandcleaningtocreatethedataset. Wenow of people in the room, while still be able to capture the describeeachofthesestepsindetail. spatialinformation.Thisallowsustodirectlytransferhuman skeletons from matching images to the query image. We showsomeexamplesofdatageneratedusingthisapproach ExtractingEmptyScenes inthetoptworowsofFig.3. Weuseacombinationofthreedifferentmodelstoextract Besides global matching of frames across different empty scenes from 100M frames: face detection, human episodesofTVshows,wealsotransferhumanposeswithin detectionandsceneclassificationscores. Inourexperiment, localshots(shortclipsatmost10seconds)ofvideos. Specif- wefindfacedetection[30]isthemostreliablecriteria. Thus, ically,givenoneemptyframewelookintothevideoframes 3 Matching Image Transferred Pose Matching Image Transferred Pose Matching Image Transferred Pose (a) Global Matching (b) Local Matching Figure3.Weproposetousetwoapproachestotransferposes.Inglobalmatchingapproach,wematchanemptyscenetoalltheimagesin thesitcom.Sometimesthematchesoccurindifferentseasons.Giventhesematches,wetransferposestotheimage.Inthelocalmatching approach,weusethenext5-10secofvideototransferposesviaoptical-flowregistrationscheme. rangingfrom5secondsbeforethisframeto5secondsafter theposestructureandregularizetheoutput. it. We perform pose estimation on every frame. We then Weexploreanalternativeway:wedecomposetheprocess computethecameramotionsofeachframewithrespectto ofpredictingposesintotwostages:(i)categoricalprediction: theemptyframebyaccumulatingtheopticalflows. Given we first cluster all the human poses in the dataset into 30 thesemotioninformation,wecanmapthedetectedposesto clusters,andpredictwhichposeclusterismostlikelygiven theemptyframe,asshowninthebottomtworowsinFig.3. thelocationinthescene;(ii)giventheposeclustercenter, we predict its scale as well as the deformations for pose jointssuchthatitfitsintotherealscene. ManualAnnotations Our goal is to use the 4.1.PoseClassification automated procedure above Source #Datapoints Asafirststep,givenaninputimageandalocation,we togeneratevalidhypothesis firstdoacategoricalpredictionofhumanposes. Butwhat of possible poses in empty HIMYM 3506 aretherightcategories? Weuseadata-drivenvocabularyin scenes. However, the align- TBBT 3997 ourcase. Specifically,weuserandomlysampled10Kposes ment procedure is not per- Friends 3872 from the training videos. We then compute the distances fect by any means. Thus, TAAHM 3212 betweeneachpairofposesusingprocrustesanalysisover wealsoneedhumanannota- ELR 5210 the2Djointcoordinates,andclustertheminto30clusters torstomanuallyadjustpose Frasier 6018 usingk-mediodclustering. Wevisualizethecentersofthe jointsbyscalingandtranslat- Seinfeld 3067 clustersasFig.5. ing. In this way, the pose Total 28882 Inthefirststageofprediction,wetrainaConvNetwhich in the empty scene can be uses the location and the scene as input and predict the alignedwiththehumaninthe likely pose class. Note that multiple pose classes could imagewheretheposeistransferredfrom. Incaseswhere be reasonable in a particular location (e.g. one can stand theposesarenotfittingwiththescene,duetoocclusionsor beforethechairorsitonthechair),thuswearenottryingto incorrectmatching,weremovesuchposes. Thefinaldataset regresstheexactposeclass. Insteadwepredictaprobability weobtaincontains28882humanposesinside11449indoor distributionoverallclassesandselectthemostlikelyone. scenes. Thedetailedstatisticsofhowmanyposesforeach Theselectedposecentercanbefurtheradjustedtofitinthe TVseriesaresummarizedinTable3. sceneinthesecondstage. TechnicalDetails: TheinputtotheConvNetistheimage 4.VAEsforEstimatingAffordances andthelocationwheretopredictlikelypose. Torepresent Givenanindoorsceneandthelocation,wewanttopredict thispointintheimage,wecropasquarepatchusingitas what is the most likely human pose. One naive approach the center and the side length is the height of the image is training a ConvNet with the image and the location as frame. Wealsocropanotherpatchinasimilarwayexcept input, predict the heat maps for each joint of the pose as thatthelengthishalftheheightoftheimage. Asillustrates state-of-the-art pose estimation approaches [3]. However, inFig.4(a),thereddotsontheinputimagesrepresentthe ourproblemisverydifferentfromstandardposeestimation, location. Thetwocroppedpatchescanofferdifferentscales sincewedonothavetheactualhumanwhichcanprovide ofinformationandwealsotakethewholeimageasinput. 4 Pose Scale and FC (36) Deformation Sample z from N(µ, Σ) FC µ, Σ (30 x 2) FC 30 Pose Classes FC FC FC FC (30) FC FC FC FC FC FC Latent FC Alex Alex Alex Alex Alex Alex Variable z Alex Alex Alex Net Net Net Net Net Net Net Net Net Scale and Deformation (36-d) Pose Class Pose Class (30-d) (30-d) (a) Classification Network (b) VAE Encoder (c) VAE Decoder Figure4. OurAffordancePredictionModel.TheencoderanddecoderinVAEsharetheweightswhicharehighlightedasgreen.Allfully connectedlayershave512neuronsunlessitisspecifiedinthefigure. deformationsandscaleasy,theconditionedinputimages andposeclassasx,andthelatentvariablessampledfroma distributionQasz. Thenthestandardvariationalequality canberepresentedas, logP(y x) KL[Q(z x,y) P(z x,y)] | � | || | Figure5. Clustercentersofhumanposesinsitcomdataset.These =E [logP(y z,x)] KL[Q(z x,y) P(z x)], (1) clusters are used as pose categories predicted by classification z⇠Q | � | || | network. whereKLrepresentstheKL-divergencebetweenthedistri- butionQ(z x,y)andthedistributionP(z x). Notethatin | | The3inputimagesareallre-scaledto227 227. VAE, we assume P(z x) is a normal distribution (0,1). ⇥ | N Giventhe3inputimages,theyareforwardedto3Con- ThedistributionQisanothernormaldistributionwhichcan vNets which share the weights between them. We apply berepresentedasQ(z x,y)=Q(z µ(x,y),�(x,y)),where | | theAlexNetarchitecture[28]fortheConvNethereandcon- µ(x,y)and�(x,y)areestimatedviatheencoderinVAE. catenate the 3 fc7 outputs. The concatenated feature is Thelog-likelihoodlogP(y z,x)ismodeledbythedecoder | furtherfullyconnectedto30outputs,whichrepresents30 inVAE.Thedetailsareexplainedasbelow. pose classes. During training, we apply SoftMax classifi- Encoder. As Fig. 4(b) illustrates, the inputs of model cation loss and the AlexNet is pre-trained with ImageNet include3imageswhichareobtainedinthesamewayasthe dataset[9]. classificationnetwork,a30-dbinaryvectorindicatingthe poseclass(onlyonedimensionisactivatedas1),anda36-d 4.2.ScaleandDeformationEstimation vectorrepresentingthegroundtruthscaleanddeformations. Giventheposeclassandscene, weneedtopredictthe TheimagesarefedintotheAlexNetmodelandweextract scaleandthedeformationsofeachjointtofittheposeinto the fc7 feature for each of them. The pose binary vector thescene. However,thescaleanddeformationsofthepose andthevectorofscaleanddeformationsarebothforwarded are not deterministic and there could be ambiguities. For though two fully connected layers. Each fully connected example, given an empty floor and a standing pose class, layer has 512 neurons. The outputs for each components itcouldbeachildstandingthere(whichcorrespondstoa are then concatenated together and fully connected to the smallerscale)oranadultstandingthere(whichcorresponds outputs. Thedimensionoftheoutputsis30 2whichare ⇥ toalargerscale).ThusinsteadofdirectlytrainingaConvNet two vectors of mean µ(x,y) and variance �(x,y) of the toregressthescaleanddeformations,weapplythecondi- distributionQ. tionalVariationalAuto-Encoder(VAE)[23,39]togenerate Wecalculatethegroundtruthscalefortheheightofpose thescaleanddeformationsconditionedontheinputscene s astheactualposeheightdividedbythenormalizedheight h andposeclass. (ranging from 0 to 1) of cluster center. The ground truth FormulationsfortheconditionalVAE.Weappliedthe scaleforthewidthsw iscalculatedinasimilarway. Given conditional VAE to model the scale and deformations of the ground truth (s ,s ), we can re-scale the cluster cen- h w the estimated pose class. For each sample, we define the ter and aligned it to the input location. The deformation 5 for each joint is the spatial distance (dx,dy) between the of the estimation, we repeat this procedure by sampling scaledcenterandgroundtruthpose. Sincewehave17pose different z for m = 10 times, and calculate the average joints(dX,dY) = (dx ,dy ,...,dx ,dy ), thereare34 distance 1 mD asthefinalresult. Ifthefinaldistance 1 1 17 17 m 1 i outputs representing the deformations. Together with the islessthanathreshold�,thenthegivenposeistakenasa P scales ,s ,theoutputsofthegeneratorare36realnum- reasonablepose. h w bers. Decoder. AsFig.4(c)shows,thedecoderhasasimilar 5.ExperimentsandResults architecture as the encoder. Instead of taking a vector of Wearegoingtoevaluateourapproachontwotasks: (i) scaleanddeformationsasinput,wereplaceitwiththelatent variablesz. Theoutputofthenetworkischangedtoa36- affordanceprediction: givenaninputimageandalocation, generatethelikelyhumanposeatthatlocation;(ii)classify d vector of scale and deformations whose groundtruth is whetheragivenposeinasceneispossibleornot. identicaltothe36-dinputvectoroftheencoder.Notethatwe sharethefeaturerepresentationsfortheconditionalvariables We train our models using data collected from the TV (imagesandclasses)betweentheencoderanddecoder. seriesof “HowI MetYour Mother”, “The BigBangThe- Training. AsindicatedbyEq.1,wehavetwolossesdur- ory”,“TwoandAHalfMan”,“EveryoneLovesRaymond”, ingtrainingtheVAEmodel. Tomaximizethelog-likelihood, “Frasier”and“Seinfeld”.Themodelsaretestedontheframes we apply a Euclidean distance loss to minimize distance collectedfrom“Friends”. Fortrainingdata,wehaveman- between the estimated scale and deformations y and the uallyfilteredandlabeled25010accurateposesover10009 ⇤ groundtruthsas, differentscenes. Fortestingdata,wehavecollected3872 accurateposesover1490differentscenesandwehavealso L1 = y⇤ y 2. (2) artificiallygenerated9572posesinthesamesceneswhich || � || areeitherphysicallyimpossibleorveryunlikelytohappen AndtheotherlossistominimizetheKL-divergencebetween intherealworld. the estimated distribution Q and the normal distribution Duringtraining,weinitializetheAlexNetimagefeature (0,1)as, extractor with ImageNet pre-training and the other layers N are initialized randomly. We apply the Adam optimizer L2 =KL[Q(z µ(x,y),�(x,y)) (0,1)]. (3) during training with learning rate 0.0002 and momentum | ||N term� =0.5,� =0.999. Toshowthatlargescaleofdata 1 2 NotethatthefirstlossL isappliedontopofthedecoder 1 matters,weperformtheexperimentsondifferentsizeofthe model, and its gradient is backpropagated though all the dataset. layers. To do this, we follow the reparameterization trick Wealsoevaluatetheperformanceofourapproachasthe introducedin[24]: werepresentthelatentvariablesasz = training dataset size increases. Specifically, we randomly µ(x,y)+�(x,y) ↵,where↵isavariablesampledfrom · sample2.5Kand10Kofdatafortraining,andcomparethese (0,1). Inthisway,thelatentvariableszisdifferentiable N modelswithourfullmodelwhichuses25Kdatafortraining. withrespecttoµand�. 4.3.Inference Baseline WecompareourVAEapproachtoaheatmapre- gressionbasedbaseline. Essentially,werepresentthehuman Giventhetrainedmodels, wewanttotackletwotasks: skeletonsasa17-channelheatmap,oneforeachjoint. We (i)generatingposesonanemptylocationofasceneand(ii) trainathree-towerAlexNet(uptoconv5)architecture,where estimateifaposefitsthesceneornot. each tower looks at a different scale of the image around Forthefirsttask,givenanimageandapointrepresenting thegivenpoint. Thetowershavesharedparametersandare thelocationintheimage,wefirstperformposeclassification initialized with ImageNet. The outputs are concatenated and obtain the normalized center pose of the correspond- acrossthetowersandpassedthroughaconvolutionandde- ingclass. Theclassificationscores,togetherwiththelatent convolutionlayertoproducea17channelheatmap,which variableszsampledfrom (0,1)andimagesareforwarded N istrainedwitheuclideanloss. to the VAE decoder model (Fig. 4 (c)). We scale the nor- malizedcenterwiththeinferredscale(s ,s )andalignthe ⇤h ⇤w 5.1.Generatingposesinthescenes posewiththeinputpoint. Thenweadjusteachjointofthe posebyaddingthedeformations(dX ,dY ). Aswementionedintheapproach,wegeneratethehuman ⇤ ⇤ Forthesecondtask,wewanttoestimatewhetheragiven poseviaestimatingtheposeclassandthescaleaswellas posefitsthesceneornot. Todothis,wefirstperformthe deformations. sameestimationoftheposegivenanemptysceneasthefirst Qualitative results. We show our prediction results as task,thenwecomputetheeuclideandistanceDbetweenthe Fig.6(a). Wehaveshownthatourmodelcangeneratevery estimatedposeandthegivenpose. Toensuretherobustness reasonableposesincludingsittingonacoach, closingthe 6 (a) Predicted Poses Ours GT Ours GT (b) Comparing to Ground Truth Figure6. Qualitativeresultsofgeneratingposes:WeshowqualitativeresultsonscenesfromFriendsdataset.(a)Aswecanseethehuman posesgeneratedseemveryreasonable.(b)Wealsocomparetheposesgeneratedusingourapproachwiththegroundtruthposes. door and reaching to the table, etc. We also compare our Wetestourmodelonthe3872samplesfrom“Friends”,the resultswiththegroundtruthasFig.6(b). Weshowthatwe results is shown in Table 1. We compare models trained cangeneratereasonableresultseventhoughitcanbevery onthreedifferentsizesofourdataset(2.5K,10Kand25K). differentfromthegroundtruthposes. Forexample,inthe Weshowthatthemoredatawehave,thehigheraccuracies 3rdcolumnofthe2ndrow, wepredictaposesittingona we can get. For the heatmap baseline, we use the inner bedwhilethegroundtruthisstandinginfrontofthebed. product of the predicted heatmap to assign the output to the cluster centers. We obtain the top-5 assignments and Wehavealsovisualizedtheresultsgivendifferentnoise standardevaluationasaboveforclassificationperformance. zasinputstotheVAEDecoderduringtesting. Forthesame Asthenumbersshow,ourapproachclearlyoutperformsthis sceneandlocation,wecangeneratedifferentposesasFig.7. heatmapbasedbaseline. Quantitativeresults. Toshowthatthegeneratedposesare inreasonable,wefirstevaluatetheperformanceofourpose Humanevaluation. Wealsoperformhumanevaluationon classification network. Note that there could be multiple our approach. Given the ground truth pose and predicted reasonableposesinagivenlocation,thusweshowour30- poseinthesamelocationofthesameimage,weaskhuman classclassificationaccuraciesgiventop1totop5guesses. whichoneismorerealistic. Wefindthat46%ofthetime 7 Figure7. AffordanceresultsforVAEwithdifferentnoiseinputs.Eachcolumnincludestwopredictionsresultsforthesameinputs.Given thehumanposeclass,weshowhowVAEcanbeusedtosamplemultiplescaleanddeformations. Figure8. Negativesamplesaddedinthetestdataset.71%ofthetestdataissuchimagesand29%ispositiveexamples. Method Top-1 Top-2 Top-3 Top-4 Top-5 someofthenegativesamplesasFig.8. Notethatalthough HeatMap(Baseline) 8.4% 19.9% 30.1% 39.1% 47.3% weusenegativedataintesting,butthereisnonegativedata Trainingwith2.5K 11.7% 21.9% 29.7% 36.1% 41.8% involvedintraining. WeshowthePrecision-Recallcurveas Trainingwith10K 13.3% 23.7% 32.3% 39.7% 46.8% Fig.9. Amongallofourapproaches,wefindthattraining Trainingwith25K 14.9% 26.0% 36.0% 43.6% 50.9% Table1.Classificationresultsonthetestset. with25Kdatapointsgivebestresults,whichisconsistent withthefirsttask. Fortheheatmapbaseline,weagainscore eachsampleastheinnerproductofpredictedheatmapwith aheatmaprepresentationofthelabeledpose. Weobserve thatthebaselinedoesbetterthanourapproachinhigh-recall regimes, which can be explained by the fact that training witheuclideanlossgeneratesanaveraged-outoutput,which islesslikelytomissapose. 6.Conclusion Inthis paper, we presentoneof thebiggestaffordance dataset. Weuse100Millionframesfromsevensitcomsto extract diverse set of scenes and how actors interact with differentobjectsinthosescenes. Ourdatasetconsistofmore than 10K scenes and 28K ways humans can interact with these 10K images. We also propose a two step approach Figure9. PRcurveforoursecondexperiment: givenanimage andpose,weproduceascoreonhowprobableitis.Weusethese to predict affordance pose given an input image and the scorestocomputetherecall-precisiongraph. location. Inthefirststep,weclassifywhichofthe30pose classesisthelikelyaffordancepose. Giventheposeclass theturkersthinkourpredictionismorerealistic. Notethat andthescene,wethenuseVariationalAutoencoder(VAE) arandomguessis50%,whichmeansourpredictionresults toextractthescaleanddeformationofthepose. VAEallows arealmostasrealasthegroundtruthandtheturkerscannot ustosamplethedistributionofpossibleposesattesttime. tellwhichisgeneratedbyourmodel. Ourresultsindicatethattheposesgeneratedbyusingour approacharequiterealistic. 5.2.Classifyinggivenposesinthescenes Acknowledgement: Thisworkwassupportedbyaresearchgrantfrom Givenaposeinalocationofthescene,wewanttoesti- AppleInc.Anyviews,opinions,findings,andconclusionsorrecommenda- matehowlikelytheposeisusingourmodel. Weperform tionsexpressedinthismaterialarethoseoftheauthor(s)andshouldnotbe ourexperimentson3872positivesamplesfrom“Friends” interpretedasreflectingtheviews,policiesorposition,eitherexpressedor and 9572 negative samples in the same scenes. We show implied,ofAppleInc. 8 References [22] H.JiangandJ.Xiao. Alinearapproachtomatchingcuboids inRGBDimages. InCVPR,2013. 2 [1] P.Agrawal,A.Nair,P.Abbeel,J.Malik,andS.Levine.Learn- [23] D.KingmaandM.Welling. Auto-encodingvariationalbayes. ing to poke by poking: Experiential learning of intuitive InICLR,2014. 1,5 physics. InarXivpreprintarXiv:1606.07419,2016. 2 [24] D.P.KingmaandM.Welling. Auto-encodingvariational [2] A.Bansal,B.Russell,andA.Gupta. Marrrevisited:2d-3d bayes. InICLR,2014. 6 alignmentviasurfacenormalprediction. InCVPR,2016. 2 [25] H.Kjellstrom,J.Romero,D.Martinez,andD.Kragic. Si- [3] J.Carreira,P.Agrawal,K.Fragkiadaki,andJ.Malik. Hu- multaneousvisualrecognitionofmanipulationactionsand manposeestimationwithiterativeerrorfeedback. InarXiv manipulatedobjects. InECCV,2008. 2 preprintarXiv:1507.06550,2015. 3,4 [26] H. Koppula and A. Saxena. Physically-grounded spatio- [4] L.Chen,G.Papandreou,I.Kokkinos,K.Murphy,andA.L. temporalobjectaffordances. InECCV,2014. 2 Yuille. Deeplab: Semanticimagesegmentationwithdeep [27] H.KoppulaandA.Saxena. Anticipatinghumanactivities convolutionalnets,atrousconvolution,andfullyconnected usingobjectaffordancesforreactiveroboticresponse.TPAMI, crfs. CoRR,abs/1606.00915,2016. 2 2015. 2 [5] S.T.D.XieandS.Zhu.Inferringdarkmatteranddarkenergy [28] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet fromvideos. InICCV,2013. 2 classificationwithdeepconvolutionalneuralnetworks. In [6] J. Dai, Y. Li, K. He, and J. Sun. R-FCN: object detec- NIPS,2012. 2,5 tionviaregion-basedfullyconvolutionalnetworks. CoRR, [29] Y.LeCun,B.Boser,J.Denker,D.Henderson,R.E.Howard, abs/1605.06409,2016. 2 W.Hubbard,andL.D.Jackel. Handwrittendigitrecognition [7] V. Delaitre, D. Fouhey, I. Laptev, J. Sivic, A. Efros, and withaback-propagationnetwork. InNIPS,1990. 2 A.Gupta. Scenesemanticsfromlong-termobservationof [30] Z.Liang,S.Ding,andL.Lin. Unconstrainedfaciallandmark people. InECCV,2012. 2 localizationwithbackbone-branchesfully-convolutionalnet- [8] J.Deng,W.Dong,R.Socher,L.-J.Li,K.Li,andL.Fei-Fei. works. InarXiv:1507.03409,2015. 3 ImageNet:ALarge-ScaleHierarchicalImageDatabase. In [31] T.Lin,M.Maire,S.J.Belongie,L.D.Bourdev,R.B.Gir- CVPR,2009. 2 shick,J.Hays,P.Perona,D.Ramanan,P.Dolla´r,andC.L. [9] J.Deng,W.Dong,R.Socher,L.-J.Li,K.Li,andL.Fei-Fei. Zitnick.MicrosoftCOCO:commonobjectsincontext.CoRR, ImageNet:ALarge-ScaleHierarchicalImageDatabase. In 2014. 2 CVPR,2009. 5 [32] R.Mottaghi,M.Rastegari,A.Gupta,andA.Farhadi. “what [10] D.EigenandR.Fergus. Predictingdepth,surfacenormals happensif...”learningtopredicttheeffectofforcesinimages. andsemanticlabelswithacommonmulti-scaleconvolutional InECCV,2016. 2 architecture. InICCV,2015. 2 [33] L.Pinto,D.Gandhi,Y.Han,Y.-L.Park,andA.Gupta. The [11] D.F.Fouhey,V.Delaitre,A.Gupta,A.A.Efros,I.Laptev, curiousrobot: Learningvisualrepresentationsviaphysical andJ.Sivic. Peoplewatching: Humanactionsasacuefor interactions. InECCV,2016. 2 single-viewgeometry. InECCV,2012. 2 [34] L.PintoandA.Gupta.Supersizingself-supervision:Learning [12] D.F.Fouhey,X.Wang,andA.Gupta.Indefenseofthedirect tograspfrom50ktriesand700robothours. InICRA,2016. perceptionofaffordances. CoRR,abs/1505.01085,2015. 1,2 2 [13] A.GallagherandT.Chen. Understandinggroupsofimages [35] A.QuattoniandA.Torralba. Recognizingindoorscenes. In ofpeople. InCVPR,2009. 3 CVPR,2009. 3 [14] J. Gibson. The ecological approach to visual perception. [36] E.Rivlin,S.Dickinson,andA.Rosenfeld. Recognitionby Boston:HoughtonMifflin,1979. 2 functionalparts. InCVIU,1995. 2 [15] R.Girdhar,D.Fouhey,M.Rodriguez,andA.Gupta.Learning [37] S.Song, S.Lichtenberg, andJ.Xiao. Sunrgb-d: Argb-d apredictableandgenerativevectorrepresentationforobjects. sceneunderstandingbenchmarksuite. InCVPR,2015. 3 InECCV,2016. 2 [38] L.StarkandK.Bowyer. Achievinggeneralizedobjectrecog- [16] R.Girshick. Fastr-cnn. InICCV,2015. 3 nition through reasoning about association of function to [17] H.Grabner,J.Gall,andL.vanGool. Whatmakesachaira structure. InPAMI,1991. 2 chair? InCVPR,2011. 2 [39] J.Walker,C.Doersch,A.Gupta,andM.Hebert. Anuncer- [18] A.GuptaandL.S.Davis. Objectsinaction: Anapproach tainfuture: Forecastingfromvariationalautoencoders. In forcombiningactionunderstandingandobjectperception. In EuropeanConferenceonComputerVision,2016. 5 CVPR,2007. 2 [40] P.Winston, T.Binford, B.Katz, andM.Lowry. Learning [19] A.Gupta,S.Satkin,A.Efros,andM.Hebert. From3Dscene physicaldescriptionfromfunctionaldefinitions, examples geometrytohumanworkspace. InCVPR,2011. 2 andprecedents. InMITPress,1984. 2 [20] A.Gupta,S.Satkin,A.A.Efros,andM.Hebert. From3d [41] J.Wu,T.Xue,J.J.Lim,Y.Tian,J.B.Tenenbaum,A.Torralba, scenegeometrytohumanworkspace. InComputerVision andW.T.Freeman. Singleimage3dinterpreternetwork. In andPatternRecognition(CVPR),2011. 3 EuropeanConferenceonComputerVision(ECCV),2016. 2 [21] K.He,X.Zhang,S.Ren,andJ.Sun. Deepresiduallearning [42] B.YaoandL.Fei-Fei.Modelingmutualcontextofobjectand for image recognition. arXiv preprint arXiv:1512.03385, humanposeinhuman-objectinteractionactivities. InCVPR, 2015. 2 2010. 2 9 [43] Y.ZhaoandS.Zhu. Sceneparsingbyintegratingfunction, geometryandappearancemodels. InCVPR,2013. 2 [44] B.Zhou,A.Lapedriza,J.Xiao,A.Torralba,andA.Oliva. Learning deep features for scene recognition using places database. InNIPS,2014. 3 [45] Y. Zhu, A. Fathi, and L. Fei-Fei. Reasoning about object affordancesinaknowledgebaserepresentation. InECCV, 2014. 2 [46] Y. Zhu, Y. Zhao, and S. Zhu. Understanding tools: Task- orientedobjectmodeling,learningandrecognition. InCVPR, 2015. 2 10
Description: