WUetal.:PHYSICS101-LEARNINGPHYSICALPROPERTIES 1 Physics 101: Learning Physical Object Properties from Unlabeled Videos JiajunWu1 1CSAIL http://jiajunwu.com MassachusettsInstituteofTechnology JosephJ.Lim2 MA,USA http://people.csail.mit.edu/lim/ 2DepartmentofComputerScience HongyiZhang1 StanfordUniversity CA,USA http://web.mit.edu/~hongyiz/www/ JoshuaB.Tenenbaum1 3GoogleResearch MA,USA http://web.mit.edu/cocosci/josh.html WilliamT.Freeman13 http://billf.mit.edu Abstract Westudytheproblemoflearningphysicalpropertiesofobjectsfromunlabeledvideos. Humanscanlearnbasicphysicallawswhentheyareveryyoung,whichsuggeststhat suchtasksmaybeimportantgoalsforcomputationalvisionsystems.Weconsidervarious scenarios:objectsslidingdownaninclinedsurfaceandcolliding;objectsattachedtoa spring;objectsfallingontovarioussurfaces,etc. Manyphysicalpropertieslikemass, density,andcoefficientofrestitutioninfluencetheoutcomeofthesescenarios,andour goalistorecoverthemautomatically.Wehavecollected17,408videoclipscontaining101 objectsofvariousmaterialsandappearances(shapes,colors,andsizes).Together,they formadataset,namedPhysics101,forstudyingobject-centeredphysicalproperties.We proposeanunsupervisedrepresentationlearningmodel,whichexplicitlyencodesbasic physicallawsintothestructureandusethem,withautomaticallydiscoveredobservations fromvideos,assupervision.Experimentsdemonstratethatourmodelcanlearnphysical propertiesofobjectsfromvideo. Wealsoillustratehowitsgenerativenatureenables solvingothertaskssuchasoutcomeprediction. 1 Introduction Howcanavisionsystemacquirecommonsenseknowledgefromvisualinput? Thisfunda- mentalquestionhasbeenofinteresttoresearchersinboththehumanvisionandcomputer visioncommunitiesfordecades. Onebasiccomponentofcommonsenseknowledgeisto understandthephysicsoftheworld.Thereisevidencethatbabiesformavisualunderstanding ofbasicphysicalconceptsataveryyoungage;theylearnpropertiesofobjectsfromtheir motions [2]. As young as 2.5 to 5.5 months old, infants learn basic physics even before theyacquireadvancedhigh-levelknowledgelikesemanticcategoriesofobjects[2,7]. Both infantsandadultsalsousetheirphysicsknowledgetolearnanddiscoverlatentlabelsofobject properties,aswellaspredictthephysicalbehaviorofobjects[3]. Thesefactssuggestthe (cid:13)c 2016.Thecopyrightofthisdocumentresideswithitsauthors. Itmaybedistributedunchangedfreelyinprintorelectronicforms. 2 WUetal.:PHYSICS101-LEARNINGPHYSICALPROPERTIES Unlabeled videos Acceleration Velocity Bounce Extended Height Distance Material: wooden Volume: 173 cm3 Mass: 110 g Coeff. Material: plastic Coeff. Restitution Volume: 50 cm3 Friction Mass Physical Laws EstimatMioassn: 28 g Material Density Volume Estimated position of object after collision Ground truth Predictions Figure1:Left:Overviewofourmodel,whichlearnsdirectlyfromunlabeledvideos,produces estimatesofphysicalpropertiesofobjectsbasedontheencodedphysicallaws,andgeneralizes totaskslikeoutcomeprediction. Right: Anabstractionofthephysicalworld. Seediscussion inSection2. importanceforavisualsystemofunderstandingphysics,andmotivateourgoalofbuildinga machinewithsuchcompetency. There have been several early efforts to build computational vision systems with the physicalknowledgeofanearlychild[6],orspecificphysicalproperties,suchasmaterial,from theperspectivesofhumanandmachineperception[1,17]. Recently,researchersstartedto tackleconcretescenariosforunderstandingphysicsfromvision[11,15,18,19,23,24,25,27], somebuildingonrecentadvancesindeeplearning. Differentfromthoseapproaches,inthis workweaimtodevelopasystemthatcaninferphysicalproperties,e.g.massanddensity, directlyfromvisualinput. Ourmethodisgeneralandeasilyadaptivetonewscenarios,andis moreefficientcomparedtoanalysis-by-synthesisapproaches[23]. Ourgoalistodiscover,learn,andinferphysicalobjectpropertiesbycombiningphysics knowledgewithmachinelearningmethods. Oursystemcombinesalow-levelvisualrecog- nitionsystemwithaphysicsinterpreterandaphysicalworldsimulatortoestimatephysical propertiesdirectlyfromunlabeledvideos. Weintegratephysicallaws,whichphysicistshave discoveredovermanycenturies,intorepresentationlearningmethods,whichhaveshownuse- fulnessincomputervisioninrecentyears. Ourincorporationofphysicallawsintoourmodel distinguishourframeworkfrompreviousmethodsintroducedinthecomputervisionand roboticscommunityforpredictingphysicalinteractionsorpropertiesofobjectsforvarious purposes[4,5,9,12,16,21,22,26]. We present a formulation to learn physical object properties with deep models. Our modelisgenerativeandcontainscomponentsforlearningbothlatentobjectpropertiesand upper-level physical arguments. Not only can this model discover object properties, but it can also predict the physical behavior of objects. Ultimately, our system can discover physicalpropertiesofobjectswithoutexplicitlabels,simplybyobservingvideoswiththe supervisionofaphysicsinterpreter. Totrainandtestoursystem,wecollectedadatasetof101 objectsmadeofdifferentmaterialsandwithavarietyofmassesandvolumes. Westartedby collectingvideosoftheseobjectsfrommultipleviewpointsinfourvariousscenarios: objects slide down an inclined surface and possibly collide with another object; objects fall onto surfacesmadeofdifferentmaterials;objectssplashinwater;andobjectshangonaspring. Theseseeminglystraightforwardsetupsrequireunderstandingmultiplephysicalproperties, e.g.,material,mass,volume,density,coefficientoffriction,andcoefficientofrestitution,as discussedinSection3.1. WecalledthisdatasetPhysics101,highlightingthatwearelearning WUetal.:PHYSICS101-LEARNINGPHYSICALPROPERTIES 3 elementaryphysics,whilealsoindicatingthecurrentobjectcount. Throughexperiments,wedemonstratethatourframeworkdevelopssomephysicscompe- tencybyobservingvideos. InSection5,wealsoshowthatourgenerativemodelcantransfer learned physical knowledge from one scenario to the other, and generalize to other tasks likepredictingtheoutcomeofacollision. Somedynamicscenarioswestudied,e.g.objects slidingdownonaninclinedsurfaceandpossiblyhittingotherobjects,aresimilartoclassic experimentsconductedbyGalileoGalileihundredsofyearsago[10],andtoexperimental conditionsinmoderndevelopmentalpsychology[2]. Ourcontributionsarethree-fold: first,weproposenovelscenariosforlearningphysical propertiesofobjects,whichfacilitatescomparisonswithstudiesofinfantvision;second,we buildadatasetofvideoswithobjectsofvariousmaterialsandappearances,specificallyto studythephysicalpropertiesoftheseobjects;third,weproposeanovelunsuperviseddeep learningframeworkforthesetasksandconductexperimentstoproveitseffectivenessand generalizationability. 2 Physical World Thereexisthighlyinvolvedphysicalprocessesindailyeventsinourphysicalworld,even simple scenarios like objects sliding down an inclined surface. As shown in Figure 1, we can divide all involved physical properties into two groups: the first is the intrinsic physicalpropertiesofobjectslikevolume, material, andmass, manyofwhichwecannot directly measure from the visual input; the second is the descriptive physical properties whichcharacterizethescenariointhevideo,includingbutnotlimitedtovelocityofobjects, distancesthatobjectstraveled,orwhetherobjectsfloatiftheyarethrownintowater. The secondgroupofparametersareobservable,andaredeterminedbythefirstgroup,whileboth ofthemdeterminethecontentinvideos. Our goal is to build an architecture that can automatically discover those observable descriptivephysicalpropertiesfromunlabeledvideos,andusethemassupervisiontofurther learnandinferunobservablelatentphysicalproperties. Ourgenerativemodelcanthenapply learnedknowledgeofphysicalobjectpropertiesforothertaskslikepredictingoutcomesinthe future. Tofurtherdevelopthis,inSection3wewillintroduceournoveldatasetforlearning physicalobjectproperties. WethenillustrateourmodelinSection4. 3 Dataset - Physics 101 TheAIandvisioncommunityhasmademuchprogressthroughitsdatasets,andthereare datasets of objects, attributes, materials, and scene categories. Here, we introduce a new typeofdataset–onethatcapturesphysicalinteractionsofobjects. Thedatasetconsistsof fourdifferentscenarios,foreachofwhichplentyofintriguingquestionsmaybeasked. For example,intherampscenario,willtheobjectontherampmove,andifsoandtwoobjects collide,whichofthemwillmovenextandhowfar? 3.1 Scenarios Weseektolearnphysicalpropertiesofobjectsbyobservingvideos. Tothisend,webuild adatasetbyrecordingvideosofmovingobjects. Wepickanintroductorysetupwithfour differentscenarios,whichareillustratedinFigures3aand4. Wethenintroduceeachscenario indetail. Ramp Weputanobjectonaninclinedsurface,andtheobjectmayeitherslidedownorkeep static,duetogravityandfriction. Asstudiedearlierin[23],thisseeminglystraightforward 4 WUetal.:PHYSICS101-LEARNINGPHYSICALPROPERTIES Bounce Height Acocne lRearamtipon Velocity Aicnc eSlienrkaintigon Extended Distance Descriptive Physical Properties Physical Laws Physical World Simulator Mass Latent Intrinsic Physical Properties Coeff. Restitution Coeff. Friction Density Physics Interpreter Material Volume Visual Intrinsic Physical Properties ConvNet ConvNet Visual Property Discoverer Patches Patches Patches Patches Patches Videos Tracking Videos Videos Videos Videos Videos With Learning Without Learning Data Figure2: Ourmodelexploitstheadvancementofdeeplearningalgorithmsanddiscovers variousphysicalpropertiesofobjects. Wesupervisealllevelsbyautomaticallydiscovered observationsfromvideos. Theseobservationsareprovidedtoourphysicalworldsimulator, whichtransformthemtoconstraintsonphysicalproperties. Duringthetrainingandtesting, ourmodelhasnolabelsofphysicalproperties,incontrasttothestandardsupervisedlearning approaches. PBllaosctikc Dough Foam PlTaosytic MCoetinal Wooden Plastic Block Ring RHuobllboewr Rubber Card- Plastic board Doll Metal Wooden Pole Pole HWoolloodw Porcelain Target (a) Ourscenarioandasnap- (b) Physics101: thisisthesetofobjectsweusedinourexperiments. shotofourdataset,Physics Wevaryobjectmaterial,color,shape,andsize,togetherwithexternal 101, of various objects at conditionssuchastheslopeofasurfaceorthestiffnessofastring.Videos different stages. Our data recordingthemotionsoftheseobjectsinteractingwithtargetobjectswill aretakenbyfoursensors(3 beusedtotrainouralgorithm. RGBand1depth). Figure3: TheobjectsandvideosinPhysics101. scenarioalreadyinvolvesunderstandingmanyphysicalobjectpropertiesincludingmaterial, coefficientoffriction,mass,andvelocity. Figure4aanalyzesthephysicsbehindoursetup. Inthisscenario,theobservabledescriptivephysicalpropertiesarethevelocitiesofthe objects,andthedistancesbothobjectstraveled. Thelatentpropertiesdirectlyinvolvedare coefficientoffrictionandmass. Spring Wehangobjectsonaspring,andgravityontheobjectwillstretchthespring. Here theobservabledescriptivephysicalpropertyislengththatthespringgetsstretched,andthe latentpropertiesarethemassoftheobjectandtheelasticityofthespring. Fall Wedropobjectsintheair,andtheyfreelyfallontovarioussurfaces.Heretheobservable descriptivephysicalpropertiesarethethebounceheightsoftheobject,andthelatentproperties arethecoefficientofrestitutionoftheobjectandthesurface. Liquid Wedropobjectsintosomeliquid,andtheymayfloatorsinkatvariousspeeds. In thisscenario,theobservabledescriptivephysicalpropertyisthevelocityofthesinkingobject WUetal.:PHYSICS101-LEARNINGPHYSICALPROPERTIES 5 RA NA NB NA NB IB→A IA→B NA NB GA RA GB GA GB GA GB I.Initialsetup II.Sliding III.Atcollision IV.Finalresult (a) Therampscenario.SeveralphysicalpropertieswilldetermineifobjectAwillmove,ifitwillreach toobjectB,andhowfareachobjectwillmove.Here,N,R,andGindicateanormalforce,afriction force,andagravityforce,respectively. I.Initial II.Final I.Initial II.AtcollisionIII.Bouncing I.Floating II.Sunk (b) Thespringscenario (c) Thefallscenario (d) Theliquidscenario Figure4: IllustrationsofthescenariosinPhysics101 (0ifitfloats),andthelatentpropertiesarethedensitiesoftheobjectandtheliquid. 3.2 BuildingPhysics101 The outcomes of various physical events depend on multiple factors of objects, such as materials(densityandfrictioncoefficient),sizesandshapes(volume),andslopesoframps (gravity),elasticitiesofsprings,etc. Wecollectourdatasetwhilevaryingalltheseconditions. Figure3bshowstheentirecollectionofour101objects,andthefollowingaremoredetails aboutourvariations: Material Our 101 objects are made of 15 different materials: cardboard, dough, foam, hollowrubber,hollowwood,metalcoin,pole,andblock,plasticdoll,ring,andtoy,porcelain, rubber,woodenblock,andwoodenpole. Appearance Foreachmaterial,wehave4to12objectsofdifferentsizesandcolors. Slope(ramp) Wealsovarytheangleα betweentheinclinedsurfaceandtheground(to varythegravityforce). Wesetα =10oand20oforeachobject. Target(ramp) Wehavetwodifferenttargetobjects–acardboardandafoambox. They aremadeofdifferentmaterials,thushavingdifferentfrictioncoefficientsanddensities. Spring Weusetwospringswithdifferentstiffness. Surface (fall) We drop objects onto five different surfaces: foam, glass, metal, wooden table,andwoolenrug. Thesematerialshavedifferentcoefficientsofrestitution. Wealsomeasurethephysicalpropertiesoftheseobjects. Werecordthemassandvolume of each object, which also determine density. For each setup, we record their actions for 3∼10trials. Wemeasuremultipletimesbecausesomeexternalfactors,e.g.,orientations ofobjectsandroughplanes,mayleadtodifferentoutcomes. Havingmorethanonetrialper conditionincreasesthediversityofourdatasetbymakingitcovermorepossibleoutcomes. Finally,werecordeachtrialfromthreedifferentviewpoints: side,top-down,andupper- top. For the first two, we take data with DSLR cameras, and for the upper-top view, we useaKinectV2torecordbothRGBanddepthmaps. Wehave4,352trialsintotal. Given we captured videos in three RGB maps and one depth map, there are 17,408 video clips altogether. ThesevideoclipsconstitutethePhysics101dataset. 6 WUetal.:PHYSICS101-LEARNINGPHYSICALPROPERTIES 4 Method Inthissection, wedescribeourapproachthatinterpretsthebasicphysicalpropertiesand inferstheinteractionofobjectswiththeworld. Ratherthantrainingeachclassifier/regressor withlabels,e.g.materialorvolume,withfullsupervision,weareinterestedindiscoveringthe propertiesunderaunifiedsystemwithminimalsupervision. Ourgoalsaretwo-fold. First,we wanttobeabletodiscoverallinvolvedphysicalpropertieslikematerialandvolumesimply byobservingmotionsofobjectsinunlabeledvideosduringtraining. Second,wealsowantto inferlearnedpropertiesandalsopredictdifferentphysicalinteractions,e.g.howfarwillthe objectmoveifitmovesatall;orwhethertheobjectwillfloatifdroppedintowater. Thesetup isshowninFigure3a. OurmodelisshowninFigure2. Ourmethodisbasedonaconvolutionalneuralnetwork (CNN)[13],whichconsistsofthreecomponents. Thebottomcomponentisavisualproperty discoverer,whichaimstodiscoverphysicalpropertieslikematerialorvolumewhichcouldat leastpartiallybeobservedfromvisualinput;themiddlecomponentisaphysicsinterpreter, whichexplicitlyencodesphysicallawsintothenetworkstructureandmodelslatentphysical propertieslikedensityandmass; thetopcomponentisaphysicalworldsimulator,which characterizesdescriptivephysicalpropertieslikedistancesthatobjectstraveled,allofwhich wemaydirectlyobservefromvideos. OurnetworkcorrespondstoourphysicalworldmodelintroducedinSection2. Wewould liketoemphasizeherethatourmodellearnsobjectpropertiesfromcompletelyunlabeleddata. Wedonotprovideanylabelsforphysicalpropertieslikematerial,velocity,orvolume;instead, ourmodelautomaticallydiscoversobservationsfromvideos, usesthemassupervisionto thetopphysicalworldsimulator,whichinturnadviseswhatthephysicsinterpretershould discover. 4.1 VisualPropertyDiscoverer Thebottommeta-layerofourarchitectureinFigure2isdesignedtodiscoverandpredict low-level properties of objects including material and volume, which can also be at least partiallyperceivedfromthevisualinput. Thesepropertiesarethebasicpartsofpredicting anyderivedphysicalpropertiesatupperlayers,e.g.densityandmass. Inordertointerpretanyphysicalinteractionofobjects,weneedtobeabletofirstlocate objectsinsidevideos. WeuseaKLTpointtracker[20]totrackmovingobjects. Wealso computeageneralbackgroundmodelforeachscenariotolocateforegroundobjects. Image patchesofobjectsarethensuppliedtoourvisualpropertydiscoverer. Materialandvolume Materialandvolumearepropertiesthatcanbeestimateddirectly fromimagepatches. Hence,wehaveLeNet[14]ontopofimagepatchesextractedbythe tracker.Onceagain,ratherthandirectlysupervisingeachLeNetwiththeirlabels,wesupervise them by automatically discovered observations which are provided to our physical world simulator. Tobeprecise,wedonothaveanyindividuallosslayerforLeNetcomponents. Notethatinferringvolumesofobjectsfromstaticimagesisanambiguousproblem. However, thisproblemisalleviatedgivenourmulti-viewandRGB-Ddata. 4.2 PhysicsInterpreter Thesecondmeta-layerofourmodelisdesignedtoencodethephysicallaws. Forinstance,if weassumeanobjectishomogeneous,thenitsdensityisdeterminedbyitsmaterial;themass ofanobjectshouldbethemultiplicationofitsdensityandvolume. Basedonmaterialand volume,weexpandanumberofphysicalpropertiesinthisphysicsinterpreter,whichwill WUetal.:PHYSICS101-LEARNINGPHYSICALPROPERTIES 7 laterbeusedtoconnecttorealworldobservations. Thefollowingshowshowwerepresent eachphysicalpropertyasalayerdepictedinFigure2: Material AN dimensionalvector,whereN isthenumberofdifferentmaterials. The m m valueofeachdimensionrepresentstheconfidencethattheobjectbelongstothatdimension. Thisisanoutputofourvisualpropertydiscoverer. Volume Ascalarrepresentingthepredictedvolumeoftheobject. Thisisanoutputofour visualpropertydiscoverer. Coefficient of friction and density Each is a scalar representing the predicted physical propertybasedontheoutputofthemateriallayer. EachoutputistheinnerproductofN m learnedparametersandresponsesfromthemateriallayer. Coefficientofrestitution AN dimensionalvectorrepresentinghowmuchofthekinetic m energy remains after a collision between the input object with other objects of various materials. The representation is a vector, not a scalar, as the coefficient of restitution is determinedbythematerialsofbothobjectsinvolvedinthecollision. Mass Ascalarrepresentingthepredictedmassbasedontheoutputsofthedensitylayerand thevolumelayer. Thislayeristheproductofthedensityandvolumelayers. 4.3 PhysicalWorldSimulator Ourphysicalworldsimulatorconnectstheinferredphysicalpropertiestoreal-worldobserva- tions. Wehavedifferentobservationsforthesescenarios. AsexplainedinSection3,weuse velocitiesofobjectsanddistancesobjectstraveledasobservationsoftherampscenario,the lengththatthestringisstretchedasanobservationofthespringscenario,thebouncedistance asanobservationofthefallscenario,andthevelocitythatobjectsinksasanobservationof theliquidscenario. Allobservationscanbederivedfromtheoutputofourtracker. To connect those observations to physical properties our model inferred, we employ physicallaws. Thephysicallawsweusedinourmodelinclude Newton’slaw F=mgsinθ−µmgcosθ =ma,or(sinθ−µcosθ)g=a,whereθ isangle between the inclined surface and the ground, µ is the friction coefficient, and a is the acceleration(observation). Thisisusedfortherampcase. Conservationofmomentumandenergy C =(v −v )/(u −u ),wherev isthevelocity R b a a b i ofobjectiaftercollision,andu isitsvelocitybeforecollision. Allu andv areobservations, i i i andthisisalsousedfortherampscenario. Hooke’slaw F =kX,whereX isthedistancethatthestringisextended(ourobservation), kisthestiffnessofthestring,andF =G=mgisthegravityontheobject. Thisisusedfor thespringscenario. (cid:112) Bounce C = h/H, whereC is the coefficient of restitution, h is the bounce height R R (observation),andH isthedropheight. Thiscanbeseenasanotherrepresentationofenergy andmomentumconservation,andisusedforthefallscenario. Buoyancy dVg−d Vg=ma=dVa,or(d−d )g=da,wheredisdensityoftheobject,d w w w isthedensityofwater(constant),andaistheaccelerationoftheobjectinwater(observation). Thisisusedfortheliquidscenario. 8 WUetal.:PHYSICS101-LEARNINGPHYSICALPROPERTIES 5 Experimental Results In this section, we present experiments in various settings. We start with verifications of ourmodelonlearningphysicalproperties. Later,weinvestigateitsgeneralizationabilityon taskslikepredictingoutcomesgivenpartialinformation,andtransferringknowledgeacross differentscenarios. 5.1 Setup ForlearningphysicalpropertiesfromPhysics101,westudyouralgorithminthefollowing settings: • Splitbyframe: foreachtrialofeachobject,weuse95%ofthepatcheswegetfrom trackingastrainingdata,whiletheother5%ofthepatchesastestdata. • Splitbytrial: foreachtrialofeachobject,weuseallpatchesin95%ofthetrialswe haveastrainingdata,whilepatchesintheother5%ofthetrialsastestdata. • Splitbyobject: werandomlychoose95%oftheobjects. Weusethemastrainingdata andtheothersastestdata. Amongthesethreesettings,splitbyframeistheeasiestasforeachpatchintestdata,the algorithmmayfindsomeverysimilarpatchinthetrainingdata. Splitbyobjectisthemost difficultsettingasitrequiresthemodeltogeneralizetoobjectsthatithasneverseenbefore. Weconsidertrainingourmodelindifferentways: • Oracletraining: wetrainourmodelwithimagesofobjectsandtheirassociatedground truthlabels. Weapplyoracletrainingonthosepropertieswehavegroundtruthslabels of(material,mass,density,andvolume). • Standalone training: we train our model on data from one scenario. Automatically extractedobservationsserveassupervision. • Jointtraining: wejointlytraintheentirenetworkonalltrainingdatawithoutanylabels ofphysicalproperties. Ouronlysupervisionisthephysicallawsencodedinthetop physicalworldsimulator. Datafromdifferentscenariossupervisedifferentlayersin thenetwork. Oracletrainingisdesignedtotesttheabilityofeachcomponentandservesasanupper boundofwhatthemodelmayachieve. Ourfocusisonstandaloneandjointtraining,where ourmodellearnsfromunlabeledvideosdirectly. Wearealsointerestedinunderstandinghowourmodelperformsatinferringsomephysical propertiespurelyfromdepthmaps. Therefore,weconductsomeexperimentswheretraining andtestdataaredepthmapsonly. Forallexperiments,weuseTorch[8]forimplementation,MSEasourlossfunction,and SGDforoptimization. Weuseabatchsizeof32,andsetthelearningrateto0.01. 5.2 LearningPhysicalProperties Materialperception: Westartwiththetaskofmaterialclassification. Table1ashowsthe accuracyoftheoraclemodelsonmaterialclassification. Weobservethattheycanachieve nearlyperfectresultsintheeasiestcase,andisstillsignificantlybetterthanchanceonthe mostdifficultsplit-by-objectsetting. BothdepthmapsandRGBmapsgivegoodperformance onthistaskwithoracletraining. Inthestandaloneandjointtrainingcase,givenwehavenolabelsonmaterials,itisnot possibleforthemodeltoclassifymaterials;instead,weexpectittoclusterobjectsbytheir WUetal.:PHYSICS101-LEARNINGPHYSICALPROPERTIES 9 Methods Frame Trial Object Mass Density Volume Methods RGB(oracle) 99.9 77.4 52.2 Frame Trial Object Frame Trial Object Frame Trial Object Depth(oracle) 92.6 62.5 35.7 RGB(oracle) 0.79 0.72 0.67 0.83 0.74 0.65 0.77 0.67 0.61 RGB(ramp) 26.9 24.7 19.7 Depth(oracle) 0.67 0.51 0.33 0.69 0.61 0.49 0.40 0.41 0.33 RGB(spring) 29.9 22.4 14.3 RGB(fall) 29.4 25.0 17.0 RGB(spring) 0.40 0.35 0.20 N/A N/A N/A N/A N/A N/A RGB(liquid) 22.2 15.4 12.6 RGB(liquid) N/A N/A N/A 0.33 0.27 0.30 N/A N/A N/A RGB(joint) 35.5 28.7 25.7 RGB(joint) 0.52 0.32 0.26 0.33 0.34 0.33 0.40 0.37 0.30 Depth(joint) 38.3 26.9 22.4 Depth(joint) 0.33 0.27 0.25 0.29 0.23 0.17 0.30 0.20 0.22 Uniform 6.67 6.67 6.67 Uniform 0 0 0 0 0 0 0 0 0 (a) (b) Table1: (a)Accuracies(%,fororacle)orclusteringpurities(%,forjointtraining)onmaterial estimation. Inthejointtrainingcase,asthereisnosupervisiononthemateriallayer,itisnot necessaryforthenetworktospecificallymaptheresponsesinthatlayertomateriallabels, andwedonotexpectthenumberstobecomparablewiththeoraclecase. Ouranalysisisjust toshoweveninthiscasethenetworkimplicitlygraspssomeknowledgeofobjectmaterials. (b)Correlationcoefficientsofourestimationsandgroundtruthformass,density,andvolume. materials. Tomeasurethis,weperformK-meansontheresponsesofthemateriallayerof testdata,andusepurity,acommonmeasureforclustering,tomeasureifourmodelindeed discoversclustersofmaterialsautomatically. AsshowninTable1a,theclusteringresults indicatethatthesystemlearnsthematerialofobjectstoacertainextent. Physicalparameterestimation: Wethentestoursystems,trainedwithorwithoutoracles, onthetask ofphysicalpropertyestimation. Weuse Pearsonproduct-momentcorrelation coefficientasmeasures. Table1bshowstheresultsonestimatingmass,density,andvolume. Noticethathereweevaluatetheoutputsonalogscaletoavoidunbalancedemphaseson objectswithlargevolumesormasses. We observe that with oracle our model can learn all physical parameters well. For standaloneandjointlearning,ourmodelalsoconsistentlyachievespositivecorrelation. The uniformestimateisderivedbycalculatingtheaverageofgroundtruthresults,e.g.,howfar objectsindeedtravel. Aswecalculatethecorrelationbetweenpredictionsandgroundtruth, alwayspredictingaconstantvalue(theoptimumuniformestimate)givesacorrelationof0. 5.3 PredictingOutcomes Wemayapplyourmodeltoavarietyofoutcomepredictiontasksfordifferentscenarios. We considerthreeofthem: howfarwouldanobjectmoveafterbeinghitbyanotherobject;how highanobjectwillbounceafterbeingdroppedatacertainheight;andwhetheranobjectwill floatinthewater. Withestimatedphysicalobjectproperties, ourmodelcananswerthese questionsusingphysicallaws. Transferring Knowledge Across Multiple Scenarios As some physical knowledge is sharedacrossmultiplescenarios,itisnaturaltoevaluatehowlearnedknowledgefromone scenariomaybeappliedtoanovelone. Hereweconsiderthecasewherethemodelistrained onallbutthefallscenarios. Wethenapplythemodeltothefallscenarioforpredictinghow highanobjectbounces. Ourintuitionisthelearnedcoefficientsofrestitutionfromtheramp scenariocanhelptopredicttosomeextent. Results Table2ashowsoutcomepredictionresults. Wecanseethatourmethodworkswell, and can also transfer learned knowledge across multiple scenarios. We also compare our methodwithaCNN(LeNet)trainedtodirectlypredictoutcomeswithoutphysicalknowledge. 10 WUetal.:PHYSICS101-LEARNINGPHYSICALPROPERTIES Cardboard Foam Material H M B H M B Tasks Methods Frame Trial Object cardboard 28.8 40.7 97.0 15.0 77.2 84.0 MovingDistance Ours(joint) 0.65 0.42 0.33 dough 27.4 25.2 84.4 150.9 105.1 113.4 MovingDistance CNN 0.63 0.39 0.21 h_wood 35.7 19.4 108.9 81.0 35.0 21.4 MovingDistance Uniform 0 0 0 m_coin 13.4 32.2 149.8 31.9 33.3 75.8 m_pole 272.9 257.6 280.0 91.4 188.7 184.0 BounceHeight Ours(joint) 0.35 0.31 0.23 p_block 29.8 82.1 97.6 46.9 57.2 35.0 BounceHeight Ours(transfer) 0.32 0.27 0.16 p_doll 49.4 23.6 44.0 128.8 41.8 93.9 BounceHeight CNN 0.36 0.25 0.15 p_toy 30.1 41.9 121.2 33.3 9.5 70.6 BounceHeight Uniform 0 0 0 porcelain 138.5 127.0 110.9 196.0 216.6 314.8 w_block 45.9 32.8 36.2 47.3 37.5 14.2 Float Ours(joint) 0.94 0.87 0.84 w_pole 78.9 88.0 138.9 58.7 89.8 74.3 Float CNN 0.96 0.86 0.81 Float Uniform 0.70 0.70 0.70 Mean 68.2 70.1 115.4 80.1 81.1 98.3 (a) Correlationcoefficientsonpredictingthemov- (b) Meansquarederrorsinpixelsofhumanpre- ingdistanceandthebounceheight,andaccuracies dictions(H),modeloutputs(M),orabaseline(B) onpredictingwhetheranobjectfloats whichalwayspredictsthemeanofalldistances Table2: Resultsforvariousoutcomepredictiontasks Figure5: Heatmapsofuserpredictions,modeloutputs(inorange),andgroundtruths(in white). Objectsfromlefttoright: dough,plasticblock,andplasticdoll. Notethatevenourtransfermodelisbetterinchallengingcases(splitbytrial/object). Further, our framework can not only predict descriptive but also intrinsic physical properties like densities,whichcouldbemoreusefulinotherdomainslikerobotics. Behavior Experiments We would like to see how well our model does compared to a human. Wethusconductedexperimentsonpredictingthemovingdistanceofanobjectafter collisiononAmazonMechanicalTurk. Specifically,amongallobjectsthatslidedown,we selectoneobjectofeachmaterial,showAMTworkersthevideosoftheobject,butonlytothe momentofcollision. Wethenaskworkerstolabelwheretheybelievethetargetobject(either cardboardorfoam)willbeafterthecollision,i.e.,howfarthetargetwillmove. Beforetesting, eachusersareprovidedfourfullvideosofotherobjectsmadeofthesamematerial,which containcompletecollisions,sothatuserscansimplyinferthephysicalpropertiesassociated withthematerialandthetargetobjectintheirmind. Wetested30userspercase. Table 2b shows the mean squared errors in pixels of human predictions (H), model predictions(M),orabaselinewhichalwayspredictsthemeanofalldistances(B).Wecan seethattheperformanceofourmodelisclosetothatofhumanonthistask. Figure5shows theheatmapsofuserpredictions,modeloutputs(orange),andgroundtruths(white). 6 Conclusion Inthispaper,wediscussedthepossibilityoflearningphysicalobjectpropertiesfromvideos. Weproposedanoveldataset,Physics101,whichcontains17,408videosof101objectsin fourscenarios,withdetailedphysicalpropertiesoftheseobjects. Wealsoproposedanovel modelforlearningphysicalpropertiesofobjectsbyencodingphysicallawsintodeepneural nets. ExperimentsdemonstratedthatourmethodworkswellonPhysics101. Wehopeour papercouldinspirefuturestudyonlearningphysicalandotherreal-worldknowledgefrom visualdata.
Description: