ebook img

Scene Graph Generation by Iterative Message Passing PDF

8.7 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Scene Graph Generation by Iterative Message Passing

Scene Graph Generation by Iterative Message Passing DanfeiXu1 YukeZhu1 ChristopherB.Choy2 LiFei-Fei1 1DepartmentofComputerScience,StanfordUniversity 2DepartmentofElectricalEngineering,StanfordUniversity {danfei, yukez, chrischoy, feifeili}@cs.stanford.edu 7 1 Abstract 0 man horse n 2 indUivnidduearsltoabnjdeicntgs ainviissoulaaltisocne.neRegloaetisobnesyhoipnsdbreetcwoegennizoinbg- object detection a J jects also constitute rich semantic information about the 0 scene. In this work, we explicitly model the objects and ... wearing glasses 1 theirrelationshipsusingscenegraphs,avisually-grounded V] gtsoer-naetpnahdtiicomanlofdsrotermlucthatuanrteingopefuntearnimaitamegsaegs.ueTc.hhWesemtrpourdcoteuplrosesodelvasecsnenotheveerlsecepenrndee-- scene graphgeneration homldainng feeding eahto frrsoem C graph inference problem using standard RNNs and learns bucket . s toiterativelyimprovesitspredictionsviamessagepassing. Figure1.Objectdetectorsperceiveascenebyattendingtoindi- c Our joint inference model can take advantage of contex- vidualobjects. Asaresult,evenaperfectdetectorwouldproduce [ tual cues to make better predictions on objects and their similaroutputsontwosemanticallydistinctimages(firstrow).We 1 relationships. Theexperimentsshowthatourmodelsignif- proposeascenegraphgenerationmodelthattakesanimageasin- v icantly outperforms previous methods on generating scene put,andgeneratesavisually-groundedscenegraph(secondrow, 6 right)thatcapturestheobjectsintheimage(bluenodes)andtheir 2 graphsusingVisualGenomedatasetandinferringsupport pairwiserelationships(rednodes). 4 relationswithNYUDepthv2dataset. 2 of object relationships [5, 20, 26, 33]. Scene graph, pro- 0 . posed by Johnson et al. [18], offers a platform to explic- 1 1.Introduction itlymodelobjectsandtheirrelationships. Inshort,ascene 0 7 Today’sstate-of-the-artperceptualmodels[15,32]have graphisavisually-groundedgraphovertheobjectinstances 1 mostlytackleddetectingandrecognizingindividualobjects inanimage,wheretheedgesdepicttheirpairwiserelation- v: in isolation. However, understanding a visual scene often ships(seeexampleinFig.1). i goes beyond recognizing individual objects. Take a look Thevalueofscenegraphrepresentationhasbeenproven X at the two images in Fig. 1. Even a perfect object detec- inawiderangeofvisualtasks,suchassemanticimagere- r torwouldstruggletoperceivethesubtledifferencebetween trieval[18],3Dscenesynthesis[4],andvisualquestionan- a amanfeedingahorseandamanstandingbyahorse. The swering[37]. Andersonetal.recentlyproposedSPICE[1] richsemanticrelationshipsbetweentheseobjectshavebeen asanenhancedautomatedcaptionevaluationmetricdefined largelyuntappedbythesemodels. Asindicatedbyaseries over scene graphs. However, these models that use scene of previous works [26, 34, 41], one crucial step towards a graphs either rely on ground-truth annotations [18], syn- deeper understanding of visual scenes is building a struc- thetic images [37], or extract a scene graph from text do- turedrepresentationthatcapturesobjectsandtheirsemantic main[1,4]. Totrulytakeadvantageofsuchrichstructure, relationships. Such representation not only offers contex- it is crucial to devise a model that automatically generates tualcuesforfundamentalrecognitiontasks[27,29,38,39] scenegraphsfromimages. but also provide values in a larger variety of high-level vi- Inthiswork,weaddresstheproblemofscenegraphgen- sualtasks[18,44,40]. eration, where the goal is to generate a visually-grounded The recent success of deep learning-based recognition scene graph from an image. In a generated scene graph, models [15, 21, 36] has resurged interest in examining the anobjectinstanceischaracterizedbyaboundingboxwith detailedstructuresofavisualscene,especiallyintheform anobjectcategorylabel,andarelationshipischaracterized 4321 by a directed edge between two bounding boxes (i.e., ob- image object proposal scene graph jectandsubject)witharelationshippredicate(rednodesin Fig. 1). The major challenge of generating scene graphs isreasoningaboutrelationships. Muchefforthasbeenex- pended on localizing and recognizing semantic relation- ships in images [6, 8, 26, 34, 39]. Most methods have focused on making local predictions of object relation- ships [26, 34], which essentially simplify the scene graph face of generationproblemintoindependentlypredictingrelation- mountain behind horse ships between pairs of objects. However, by doing lo- CNN+RPN Graph riding cal predictions these models ignore surrounding context, Inference man wearing hat whereasjointreasoningwithcontextualinformationcanof- wearing shirt tenresolveambiguityduetolocalpredictionsinisolation. Tocapturethisintuition,weproposeanovelend-to-end Figure2.Anoverviewofourmodelarchitecture. Givenanimage modelthatlearnstogenerateimage-groundedscenegraphs asinput,themodelfirstproducesasetofobjectproposalsusing (Fig. 2). The model takes an image as input and outputs a aRegionProposalNetwork(RPN)[32], andthenpassestheex- scenegraphthatconsistsofobjectcategories,theirbound- tractedfeaturesoftheobjectregionstoournovelgraphinference ingboxes,andsemanticrelationshipsbetweenpairsofob- module. The output of the model is a scene graph [18], which jects. Our major contribution is that instead of inferring containsasetoflocalizedobjects,categoriesofeachobject,and each component of a scene graph in isolation, the model relationshiptypesbetweeneachpairofobjects. passesmessagescontainingcontextualinformationbetween a pair of bipartite sub-graphs of the scene graph, and it- relationshipdetectionbycombiningvisualinputswithlan- eratively refines its predictions using RNNs. We evaluate guagepriorstocopewiththelong-taildistributionofreal- ourmodelontheVisualGenomescenegraphdataset[20], world relationships. However, their method predicts each which contains human-annotated scene graphs on 108,077 relationship independently. We show that our model out- images. Onaverage,eachimageisannotatedwith13.5ob- performstheirswithjointinference. jects and 15 pairwise object relationships. We show that relationshippredictioninscenegraphscanbesignificantly Visual scene representation. One of the most popular improved by our model. Furthermore, we also apply our waysofrepresentingavisualsceneisthroughtextdescrip- modeltotheNYUDepthv2dataset[28],establishingnew tions [14, 34, 44]. Although text-based representation has state-of-the-art results in reasoning about spatial relations, been shown to be helpful for scene classification and re- suchashorizontalandverticalsupports. trieval, its power is often limited by ambiguity and lack Insummary, weproposeanend-to-endmodelthatgen- of expressiveness. In comparison, scene graphs [18] of- erates visually-grounded scene graphs from images. The ferexplicitgroundingofvisualconcepts,avoidingreferen- modelusesanovelinferenceformulationthatiterativelyre- tial uncertainty in text-based representation. Scene graphs fines its prediction by passing contextual messages along havebeenusedinmanydownstreamtaskssuchasimagere- thetopologicalstructureofascenegraph. Wedemonstrate trieval[18],3Dscenesynthesis[4]andunderstanding[10], itsuseforgeneratingsemanticscenegraphsfromtheVisual visualquestionanswering[37],andautomaticcaptioneval- Genome scene graph dataset as well as predicting support uation[1]. However, previousworkonscenegraphsshied relationsusingtheNYUDepthv2dataset[28]. away from the graph generation problem by either using ground-truth annotations [18, 37], or extracting the graphs 2.RelatedWork from other modalities [1, 4, 10]. Our work addresses the problemofgeneratingscenegraphsdirectlyfromimages. Sceneunderstandingandrelationshipprediction.Visual sceneunderstandingoftenharnessesthestatisticalpatterns Graphinference. ConditionalRandomFields(CRF)have of object co-occurrence [11, 22, 30, 35] as well as spa- been used extensively in graph inference. Johnson et al. tial layout [2, 9]. A series of contextual models based on used CRF to infer scene graph grounding distributions for surrounding pixels and regions have also been developed imageretrieval[18]. Yatskaretal.[40]proposedsituation- for perceptual tasks [3, 13, 25, 27]. Recent works [6, 31] driven object and action prediction using a deep CRF exploits more complex structures for relationship predic- model. OurworkiscloselyrelatedtoCRFasRNN[43]and tion. However, these works focus on image-level predic- Graph-LSTM[23]inthatwealsoformulatethegraphinfer- tions without detailed visual grounding. Physical rela- enceproblemusinganRNN-basedmodel.Akeydifference tionships, such as support and stability, have been studied isthattheyfocusonnodeinferencewhiletreatingedgesas in[17,28,42]. Luetal.[26]directlytackledthesemantic pairwise constraints, whereas we enable edge predictions 4322 usinganovelprimal-dualgraphinferencescheme. Wealso Thedualgraphdefineschannelsformessagestopassfrom share the same spirit as Structural RNN [16]. A crucial node GRUs to edge GRUs. With such primal-dual formu- distinction is that our model iteratively refines its predic- lation, we can therefore improve inference efficiency by tionsthroughmessagepassing,whereastheStructuralRNN iteratively passing messages between these sub-graphs in- modelonlymakesone-timepredictionsalongthetemporal steadofthroughadenselyconnectedgraph. Fig.3givesan dimension,andthuscannotrefineitspastpredictions. overviewofourmodel. 3.1.ProblemFormulation 3.SceneGraphGeneration We first lay out the mathematical formulation of our A scene graph, as defined by Johnson et al. [18], is a scene graph generation problem. To generate a visually structured representation of an image, where nodes in a grounded scene graph, we need to obtain an initial set of scenegraphcorrespondtoobjectboundingboxeswiththeir object bounding boxes. These bounding boxes can be ei- objectcategories,andedgescorrespondtotheirpairwisere- therfromground-truthhumanannotationoralgorithmically lationships between objects. The task of scene graph gen- generated.Inpractice,weusetheRegionProposalNetwork erationistogenerateavisually-groundedscenegraphthat (RPN)[32]toautomaticallygenerateasetofobjectbound- mostaccuratelycorrelateswithanimage. Intuitively,indi- ingboxproposalsB fromanimageI asthebaseinputto vidual predictions of objects and relationships can benefit I theinferenceprocedure(Fig.3(a)). from their surrounding context. For instance, knowing “a Foreachobjectboxproposal,weneedtoinfertwotypes horse is on grass field” is likely to increase the chance of of object-centric variables: 1) an object class label, and 2) detecting a person and predicting the relationship of ”man four bounding box offsets relative to the proposal box co- ridinghorse”. Tocapturethisintuition, weproposeajoint ordinates, which are used for refining the proposal boxes. inference framework to enable contextual information to Inaddition,weneedtoinferarelationship-centricvariable propagatethroughthescenegraphtopologyviaamessage between every pair of proposal boxes, which denotes the passingscheme. predicate type of the relationship between the correspond- However, inference on a densely connected graph can ing object pair. Given a set of object classes C (including be very expensive. As shown in previous work [19] and background) and a set of relationship types R (including [43], dense graph inference can be approximated by mean none relationship), we denote the set of all variables to be field in Conditional Random Fields (CRF). Our approach x = {xcls,xbbox,x |i = 1...n,j = 1...n,i (cid:54)= j}, is inspired by Zeng et al. [43], which designs fully differ- i i i→j where n is the number of proposal boxes, xcls ∈ C is the entiablelayerstoenableend-to-endlearningwithrecurrent i class label of the i-th proposal box, xbbox ∈ R4 is the neuralnetworks(RNN).Yettheirmodelreliesonpurpose- i boundingboxoffsetsrelativetothei-thproposalboxcoor- built RNN layers. To achieve greater flexibility in a more dinates,andx ∈Ristherelationshippredicatebetween principledtrainingframework, weuseagenericRNNunit i→j thei-thandthej-thproposalboxes. instead,inparticularaGatedRecurrentUnit(GRU)[7]. At Atthehighlevel,theinferencetaskistoclassifyobjects, eachiteration,eachGRUtakesitsprevioushiddenstateand predicttheirboundingboxoffsets,andclassifyrelationship anincomingmessageasinput,andproducesanewhidden predicatesbetweeneachpairofobjects. Formally, wefor- state as output. Each node and edge in the scene graph mulate the scene graph generation problem as finding the maintains its internal state in its corresponding GRU unit, optimal x∗ = argmax Pr(x|I,B ) that maximizes the whereallnodessharethesameGRUweights(nodeGRUs), x I following probability function given the image I and box and all edges share the other set of GRU weights (edge proposalsB : GRUs). This setup allows the model to pass messages (an I aggregation of GRU hidden states) among the GRU units alongthescenegraphtopology. Wealsoproposeamessage Pr(x|I,B )= (cid:89)(cid:89)Pr(xcls,xbbox,x |I,B ). (1) I i i i→j I pooling function that learns to dynamically aggregate the i∈V j(cid:54)=i hiddenstatesoftheGRUsintomessages. We further observe that the unique structure of scene In the next subsection, we introduce a way to approx- graphsformsabipartitestructureofmessagepassingchan- imate the inference procedure using an iterative message nels. Sincemessagesonlypassalongthetopologicalstruc- passingschememodeledwithGatedRecurrentUnits[7]. ture of a scene graph, the set of edge GRUs and the set of 3.2.InferenceusingRecurrentNeuralNetwork node GRUs form a bipartite graph, where no message is passed inside each set. Inspired by this observation, we Weusemeanfieldtoperformapproximateinference.We formulate two disjoint sub-graphs that are essentially the denotetheprobabilityofeachvariablexasQ(x|·),andas- dual graph to each other. The primal graph defines chan- sumethattheprobabilityonlydependsonthecurrentstate nelsformessagestopassfromedgeGRUstonodeGRUs. of each node and edge at each iteration. In contrast to 4323 node message pooling scene graph object proposal eGdRgUe pgrriampahl edingbeo sutnadtes eGdRgUe eGdRgUe node node edge state message feature eodugteb osutanteds nodpeo moelinsgsage eGdRgUe message message ... passing passing fenaotduere gdraupahl osbtajeteecdtge edge edgpeo moelinsgsage nGoRdUe mofuancteain behoifnd horse state message riding nGoRdUe ssutbajteect nGoRdUe nGoRdUe man wearing hat wearing shirt T = 0 edge message pooling T = 1 T = 2 T = N (a) (b) (c) (d) Figure3.Anillustrationofourmodelarchitecture(Sec.3).Themodelfirstextractsvisualfeaturesofnodesandedgesfromasetofobject proposals,andedgeGRUsandnodeGRUsthentakethevisualfeaturesasinitialinputandproduceasetofhiddenstates(a).Thenanode messagepoolingfunctioncomputesmessagesthatarepassedtothenodeGRUinthenextiterationfromthehiddenstates. Similarly,an edgemessagepoolingfunctioncomputesmessagesandfeedtotheedgeGRU(b). The⊕symboldenotesalearntweightedsum. The modeliterativelyupdatesthehiddenstatesoftheGRUs(c). Atthelastiterationstep,thehiddenstatesoftheGRUsareusedtopredict objectcategories,boundingboxoffsets,andrelationshiptypes(d). Zengetal.[43],weuseagenericRNNmoduletocompute wecanfurtherimprovetheinferenceefficiencybyleverag- thehiddenstates. Inparticular,wechooseGatedRecurrent ing the unique bipartite structure of a scene graph. In the Units[7]duetoitssimplicityandeffectiveness. Weusethe scenegraphtopology, theneighborsoftheedgeGRUsare hiddenstateofthecorrespondingGRU,ahigh-dimensional node GRUs, and vice versa. Passing messages along this vector,torepresentthecurrentstateofeachnodeandeach structure forms two disjoint sub-graphs that are the dual edge. Asallthenodes(edges)sharethesameupdaterule, graph to each other. Specifically, we have a node-centric we share the same set of parameters among all the node primalgraph,inwhicheachnodeGRUgetsmessagesfrom GRUs, and the other set of parameters among all the edge its inbound and outbound edge GRUs. In the edge-centric GRUs(Fig.3). Wedenotethecurrenthiddenstateofnode dual graph, each edge GRU gets messages from its sub- iash andthecurrenthiddenstateofedgei → j ash . ject node GRU and object node GRU (Fig. 3(b)). We can i i→j Thenthemeanfielddistributioncanbeformulatedas thereforeimproveinferenceefficiencybyiterativelypassing n messagesbetweenthesetwosub-graphsinsteadofthrough (cid:89) Q(x|I,B )= Q(xcls,xbbox|h )Q(h |fv) adenselyconnectedgraph(Fig.3(c)). I i i i i i i=1 (2) AseachGRUreceivesmultipleincomingmessages,we (cid:89) Q(xi→j|hi→j)Q(hi→j|fie→j) needanaggregationfunctionthatcanfuseinformationfrom j(cid:54)=i allmessagesintoameaningfulrepresentation. Ana¨ıveap- wherefv isthevisualfeatureofthei-thnode,andfe is proachwouldbestandardpoolingmethodssuchasaverage- i i→j ormax-pooling.However,wefoundthatitismoreeffective thevisualfeatureoftheedgefromthei-thnodetothej-th tolearnadaptiveweightsthatcanmodulatetheinfluencesof node. In the first iteration, the GRU units take the visual features fv and fe as input (Fig. 3(a)). We use the visual incomingmessagesandonlykeeptherelevantinformation. featureoftheproposalboxasthevisualfeaturefv forthe Weintroduceamessagepoolingfunctionthatcomputesthe i weightfactorsforeachincomingmessageandfusethemes- i-thnode. Weusethevisualfeatureoftheunionboxover theproposalboxesb ,b asthevisualfeaturefe foredge sagesusingaweightedsum. Weprovideanempiricalanal- i j i→j ysisofdifferentmessagepoolingfunctionsinSec.4. i∈j. ThesevisualfeaturesareextractedbyaROI-pooling layer[12]fromtheimage. Inlateriterations,theinputsare Formally,giventhecurrentGRUhiddenstatesofnodes theaggregatedmessagesfromotherGRUunitsofthepre- and edges h and h , we denote the messages to update i i→j viousstep. Wetalkabouthowthemessagesareaggregated thei-thnodeasm ,whichiscomputedbyafunctionofits i andpassedinthenextsubsection. own hidden state h , and the hidden states of its outbound i edgeGRUsh andinboundedgeGRUsh . Similarly, 3.3.PrimalDualUpdateandMessagePooling i→j j→i wedenotethemessagetoupdatetheedgefromthei-thnode Sec. 3.2 offers a generic formulation for solving graph tothej-thnodeasm ,whichiscomputedbyafunction i→j inferenceproblemusingRNNs. However,weobservethat ofitsownhiddenstateh ,thehiddenstatesofitssubject i→j 4324 node GRU h and its object node GRU h . To be more show that our model outperforms the baseline model [26], i j specific, m andm arecomputedbythefollowingtwo andcangeneralizetoothertypesofrelationships,inpartic- i i→j adaptivelyweightedmessagepoolingfunctions: ularsupportrelations[28],withoutanyarchitecturechange. VisualGenomeTheVisualGenomedataset[20]isasuper m = (cid:88) σ(vT[h ,h ])h + (cid:88) σ(vT[h ,h ])h set of the visual relationship dataset used in [26]. It con- i 1 i i→j i→j 2 i j→i j→i tains 108,077 images annotated with on average 13.5 ob- j:i→j j:j→i jectsand15relationshipsperimage. Theobjectcategories (3) andrelationshippredicatesareannotatedwithanopenvo- m =σ(wT[h ,h ])h +σ(wT[h ,h ])h (4) i→j 1 i i→j i 2 j i→j j cabulary. There are 75,729 unique object categories, and 40,480 unique relationship predicates. As the annotations where[·]denotesaconcatenationofvectors,andσdenotes followanextremelylong-taildistribution,weusethemost asigmoidfunction.w ,w andv ,v arelearnableparam- 1 2 1 2 frequent150objectcategoriesand50predicatesforevalu- eters. Thesetwoequationsdescribetheprimal-dualupdate ation. Asaresult,eachimagehasascenegraphofaround rules,asshownin(b)ofFig.3. 12 objects and 7 relationships. We use 70% of the images 3.4.ImplementationDetails fortrainingandtheremaining30%fortesting. OurfinaloutputlayersfollowcloselywiththefasterR- NYUDepthV2Wealsoevaluateourmodelonthesupport CNNsetup[32].Weuseasoftmaxlayertoproducethefinal relation graphs from the NYU Depth v2 dataset [28]. The scoresfortheobjectclassaswellasrelationshippredicate. datasetcontains1,449RGBDimagescapturedin27indoor We use an FC layer to regress to the bounding box offsets scenes. Each image is annotated with instance segmenta- for each object class separately. We use cross-entropy to tion, region class labels, and support relations between re- compute the loss on the object class and the relationship gions. Weusethestandardsplit,with795imagesusedfor predicate. We use L1 function to compute the loss of the trainingand654imagesfortesting. boundingboxoffsets. WeuseanMSCOCO-pretrainedVGG-16networktoex- 4.1.SemanticSceneGraphGeneration tractvisualfeaturesfromimages. Wefreezetheweightsof Setup Given an image, the scene graph generation task allconvolutionlayers,andonlyfinetunethefullyconnected istolocalizeasetofobjects,classifytheircategorylabels, layers, includingtheGRUs. ThenodeGRUsandtheedge and predict relationships between each pair of the objects. GRUs have both 512-dimensional input and output. Dur- WeevaluateourmodelontheVisualGenomedataset. We ingtraining,wefirstuseNMStoselectatmost2,000boxes analyzeourmodelinthreesetupsbelow. fromallproposedboxesB ,andthenrandomlyselect256 I boxes as the object proposals. Due to the quadratic num- 1. The predicate classification (PREDCLS) task is to ber of edges, we randomly sub-sample 512 edges for each predict the predicates of all pairwise relationships of image at training time. Edges without predicate labels are a set of localized objects. This task examines the assignedto“none”class. Attesttime, weuseNMStose- model’sperformanceonpredicateclassificationiniso- lectatmost50boxesfromtheobjectproposalswithanIoU lationfromotherfactors. threshold of0.3. Wemake predictions onall edges except theself-connectionsatthetesttime. 2. The scene graph classification (SGCLS) task is to predict the predicate as well as the object categories 4.Experiments ofthesubjectandtheobjectineverypairwiserelation- shipgivenasetoflocalizedobjects. Weevaluateourmodelongeneratingscenegraphsfrom images. Wecompareourmodelagainstarecentlyproposed 3. The scene graph generation (SGGEN) task is to si- modelonvisualrelationshipprediction[26]. Ourgoalisto multaneously detect a set of objects and predict the analyze our model in datasets with both sparse and dense predicate between each pair of the detected objects. relationshipannotations. WeusetheVisualGenomescene An object is considered to be correctly detected if it graphdataset[20]inourmainexperiment. Wealsoevalu- hasatleast0.5IoUoverlapwiththeground-truthbox. ate our model on the support relation inference task in the NYU Depth v2 dataset. The key difference between these We adopted the image-wise recall evaluation metrics, twodatasetsisthatscenegraphannotationisverysparsein R@50 and R@100, that are used in Lu et al. [26] for VisualGenome: amongallpossiblepairingofobjects,only all the three setups. The R@k metric measures the 5% of them are labeled with a relationship predicate. The fraction of ground-truth relationship triplets (subject- NYUDepthv2dataset,ontheotherhand,exhaustivelyan- predicate-object)thatappearamongthetopkmost notatesthesupportofeverylabeledobject.Ourexperiments confidenttripletpredictionsinanimage. Thechoiceofthis 4325 Table1.Evaluationresultsofthescenegraphgenerationtaskon 0.50 theVisualGenomedataset[20]. Wecompareafewvariationsof 00.45 baseline ourmodelagainstavisualrelationshipdetectionmoduleproposed 10 avg. pool byLuetal.[26](Sec.4.1.1). @ 0.40 max pool R final model [26] avg.pool maxpool final 0.35 R@50 26.67 29.84 31.83 40.10 PREDCLS R@100 33.32 38.01 41.19 49.67 0.30 0 1 2 3 R@50 10.11 15.24 15.61 19.11 number of iterations SGCLS R@100 12.64 17.75 18.21 21.59 Figure4.Predicateclassificationperformance(R@100)usingour R@50 0.08 2.63 2.67 3.10 SGGEN modelswithdifferentnumbersoftrainingiterations.Notethatthe R@100 0.14 3.21 3.17 3.63 baselinemodelisequivalenttoourmodelwithzeroiteration,asit feedsthenodeandedgevisualfeaturesdirectlytotheclassifiers. Table 2. Predicate classification recall. We compare our final model (trained with two iterations) with Lu et al. [26]. Top 20 metricis,asexplainedin[26],duetothesparsityoftherela- mostfrequenttypes(sortedbyfrequency)areshown.Theevalua- tionshipannotationsinVisualGenome—metricslikemAP tionmetricisrecall@5. wouldfalselypenalizepositivepredictionsonunlabeledre- predicate [26] ours predicate [26] ours lationships. Wealsoreportper-typerecall@5ofclassifying individual predicate. This metric measures the fraction of on 99.83 99.17 under 25.32 56.93 has 97.72 96.47 sittingon 49.48 57.01 thetimethecorrectpredicateisamongthetop5mostcon- in 73.56 88.77 standingon 51.43 67.01 fident predictions of each labeled relationship triplet. As of 88.59 96.18 infrontof 31.52 64.63 showninTable2,manypredicateshaveverysimilarseman- wearing 98.32 98.01 attachedto 11.81 27.43 tic meanings, for example, on vs. over and hanging near 87.46 95.14 at 57.73 70.00 from vs. attached to. The less frequent predicates with 29.42 88.00 hangingfrom 0.00 0.00 would be overshadowed by the more frequent ones during above 47.48 70.94 over 4.17 0.69 training.Weusetherecallmetrictoalleviatesuchaneffect. holding 55.67 82.80 for 5.61 11.21 behind 76.43 84.12 riding 82.03 91.18 4.1.1 NetworkModels covering the union of the two objects, making it likely to Weevaluateourfinalmodelandanumberofbaselinemod- confuse the subject and the object. We showcase some of els. One of the key components in our primal-dual for- the errors later in a qualitative analysis. Our final model mulation is the message pooling functions that use learnt withlearntweightedsumovertheconnectinghiddenstates weightedsumtoaggregatehiddenstatesofnodesandedges greatlyoutperformsthebaselinemodel(16%gainonpred- into messages (see Eq. 3 and Eq. 4). In order to demon- icateclassificationwithR@100metric)andthemodelvari- strate its effectiveness, we evaluate variants of our model ants. Thisshowsthatlearningtomodulatetheinformation withstandardpoolingmethods. Thefirstistouseaverage- fromotherhiddenstatesenablesthenetworktoextractmore pooling (avg. pool) instead of the learnt weighted sum to relevantinformationandyieldssuperiorperformances. aggregatethehiddenstates.Thesecondissimilartothefirst Fig. 4 shows the predicate classification performances one, but uses max-pooling (max pool). We also evaluate of our models trained with different numbers of iterations. ourmodelsagainstarelationshipdetectionmodelproposed byLuetal.[26]. Theirmodelconsistsoftwocomponents The performance of our final model peaks at training with two iterations, and gradually degrades afterwards. We hy- –avisionmodulethatmakespredictionsfromimages,and pothesize that this is because as the number of iterations alanguagemodulethatcaptureslanguagepriors. Wecom- increases, noisy messages start to permeate through the pare with their vision module, which uses the same inputs graph and hamper the final prediction. The max-pooling asours; theirlanguagemoduleisorthogonaltoourmodel, andaverage-poolingmodels,ontheotherhand,barelyim- and can be added independently. Note that this model is prove after the first iteration, showing ineffective message equivalenttoourfinalmodelwithoutanymessagepassing. passingduetothesena¨ıveaggregationmethods. Finally, Table 2 shows results of per-type predicate re- 4.1.2 Results call. Boththebaselinemodelandourfinalmodelperform Table1showstheperformancesofourmodelandthebase- wellinpredictingfrequentpredicates.However,thegapbe- lines.Thebaselinemodel[26]makesindividualpredictions tweenthemodelsexpandsforlessfrequentpredicates. This on objects and relationships in isolation. The only infor- is because our model uses contextual information to cope mationthatthepredicateclassifiertakesisaboundingbox withtheunevendistributionintherelationshipannotations, 4326 (a) (b) (c) vase wearing hat unknown1 on building on in man wreidairningg shirt umbrella holding holding counter flower N=0 eye riding horse unknown wearing man on on in (baseline) glass wearing unknown on bear head wearing on N u m face of tree behind building on vasein . of tra mountain behind horse umbrella on holding table flower N=1 inin man wwreeidaairrniinnggg shhairtt sghnleaoaswds ooonff woman beaatr in in g iteratio on ns (N ) vase face of tree near building on with mountain behind horse umbrella behind holding table flower riding head of man under under in N=2 man wearing hat window on bear wearing shirt glass on on vase face of tree in front of building on has mountain behind horse umbrella over holding table flower ground on head of man has has in truth man has hat sgtlraesest oonf bear has shirt on tree near sign on sign1 window building window1 on window on near pole on fence number on train shirt wearing has shoe on man wearing short leg of on window1 on N=2 hat shoe has arm on on has arm1 has hand holding racket man wearing shirt man wearing near near wearing hat on pant horse horse1 on wearing pant on Figure5.Samplepredictionsfromthebaselinemodelandourfinalmodeltrainedwithdifferentnumbersofmessagepassingiterations.The modelstakeimagesandobjectboundingboxesasinput,andproduceobjectclasslabels(blueboxes)andrelationshippredicatesbetween eachpairofobjects(orangeboxes). Inordertokeepthevisualizationinterpretable,weonlyshowtherelationship(edge)predictionsfor thepairsofobjects(nodes)thathaveground-truthrelationshipannotations. whereas the baseline model suffers more from the skewed model. The results show that the baseline model tends to distributionbymakingpredictionsinisolation. confuse about the subject and the object in a relationship. For example, it predicts (umbrella-holding-man) in (b) and (counter-on-vase) in (c). Our final 4.1.3 Qualitativeresults model trained with one iteration is able to resolve some of the ambiguity in the object-subject direction. For Fig.5showsqualitativeresultsthatcompareourfinalmodel example, it predicts (umbrella-on-woman) and trainedwithdifferentnumbersofiterationsandthebaseline 4327 Table3.Evaluationresultsofsupportgraphgenerationtask. t-ag standsfortype-agnosticandt-awstandsfortype-aware. SupportAccuracy PREDCLS t-ag t-aw R@50 R@100 Silbermanetal.[28] 75.9 72.6 - - Liaoetal.[24] 88.4 82.1 - - Baseline[26] 87.7 85.3 34.1 50.3 Finalmodel(ours) 91.2 89.0 41.8 55.5 (head-of-man) in (b), but it still predicts cyclic relationships like (vase-in-flower-in-vase). Finally, the final model trained with two iterations is able to make semantically correct predictions, e.g., (umbrella-behind-man), and resolves the cyclic Figure6.Samplesupportrelationpredictionsfromourmodelon the NYU Depth v2 dataset [28]. →: support from below, (cid:40): relationships, e.g., (vase-with-flower-in-vase). support from behind. Red arrows are incorrect predictions. We Another observation is that our model often pre- also color code structure classes: ground is in blue, structure is dicts predicates that are semantically more accu- in green, furniture is in yellow, prop is in red. Purple indicates rate than the ground-truth annotations, e.g., our missingstructureclass.Notethatthesegmentationmasksareonly model predicts (man-wearing-hat) in (a) and shownforvisualizationpurpose.Ourmodelusesobjectbounding table-under-vase in (c), whereas the ground-truth boxesasinputinsteadofthesemasks labelsare(man-has-hat)and(table-has-vase), respectively. The bottom part of Fig. 5 showcases more qualitativeresults. Results Our model outperforms previous work, achiev- ing new state-of-the-art performance using only RGB im- ages. Our results show that having contextual informa- 4.2.SupportRelationPrediction tionfurtherimprovessupportrelationprediction,evencom- WethenevaluateontheNYUDepthv2dataset[28]with paredtopurpose-builtmodels[24,28]thatusedRGBDim- denselylabeledsupportrelations. Weshowthatourmodel ages. Fig.6showssomesamplepredictionsusingourfinal cangeneralizetoothertypeofrelationshipsandiseffective model. Incorrect predictions typically occur in ambiguous onbothsparselyanddenselylabeledrelationships. supports, e.g., books in shelves can be mistaken as being supported from behind (row 1, column 2). Another fail- uremodeisduetogeometricstructuresthathaveweakvi- Setup The NYU Depth v2 dataset contains three types sual features. As shown in row 2, column 1, the ceiling at of support relationships: an object can be supported by the top left corner of the image is predicted as supported an object from behind, by an object from below, or sup- frombehindinsteadofsupportedbelowbythewall,butthe portedbyahiddenobject. Eachobjectisalsolabeledwith boundary between the ceiling and the wall is nearly invis- oneofthefourstructureclasses:{floor, structure, ible. Such visual uncertainty may be resolved by having furniture, prop}. Wedefinethesupportgraphgen- additionaldepthinformation. erationtaskastopredictingboththsupportrelationtypebe- tweenobjectsandthestructureclassofeachobject.Wetake 5.Conclusions thesmallestboundingboxthatenclosesanobjectsegmen- tation mask as its object region. We assume ground-truth Weaddressedtheproblemofautomaticallygeneratinga objectlocationsinthistask. visually grounded scene graph from an image by a novel We compare our final model with two previous mod- end-to-end model. Our model performs iterative message els [28, 24] on the support graph generation task. Follow- passing between the primal and dual sub-graph along the ing the metric used in previous work, we report two types topologicalstructureofascenegraph.Thisway,itimproves of support relation accuracies [28]: type-aware and type- the quality of node and edge predictions by incorporating agnostic. We also report the performance with R@50 and informativecontextualcues. Ourmodelcanbeconsidered R@100 measurements of the predicate classification task amoregenericframeworkforgraphgenerationproblem.In introduced in Sec. 4.1. Note that both [28] and [24] use thiswork,wehavedemonstrateditseffectivenessinpredict- RGBDimages,whereasourmodelusesonlyRGBimages. ingVisualGenomescenegraphsaswellassupportrelations 4328 inindoorscenes.Apossiblefuturedirectionwouldbetoex- [15] K.He,X.Zhang,S.Ren,andJ.Sun. Deepresiduallearning plore its capability in other structured prediction problems forimagerecognition. CVPR,2016. invisionandotherproblemdomains. [16] A.Jain,A.R.Zamir,S.Savarese,andA.Saxena.Structural- rnn:Deeplearningonspatio-temporalgraphs.arXivpreprint arXiv:1511.05298,2015. Acknowledgements We would like to thank Ranjay Kr- [17] Z.Jia,A.Gallagher,A.Saxena,andT.Chen. 3d-basedrea- ishna,JudyHoffman,JunYoungGwak,andanonymousre- soning with blocks, support, and stability. In Proceedings viewersforusefulcomments.Thisresearchispartiallysup- of the IEEE Conference on Computer Vision and Pattern portedbyaYahooLabsMacroaward,andanONRMURI Recognition,2013. award. [18] J. Johnson, R. Krishna, M. Stark, L. J. Li, D. A. Shamma, M.S.Bernstein,andL.Fei-Fei. Imageretrievalusingscene References graphs.InIEEEConferenceonComputerVisionandPattern Recognition(CVPR),2015. [1] P.Anderson,B.Fernando,M.Johnson,andS.Gould.Spice: [19] P. Kra¨henbu¨hl and V. Koltun. Efficient inference in fully Semanticpropositionalimagecaptionevaluation. InECCV, connectedcrfswithgaussianedgepotentials.InAdvancesin 2016. NeuralInformationProcessingSystems24,2011. [2] R. Baur, A. Efros, and M. Hebert. Statistics of 3d object [20] R.Krishna,Y.Zhu,O.Groth,J.Johnson,K.Hata,J.Kravitz, locationsinimages. 2008. S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma, M. Bern- [3] S. Bell, C. L. Zitnick, K. Bala, and R. Girshick. Inside- stein,andL.Fei-Fei. Visualgenome: Connectinglanguage outside net: Detecting objects in context with skip andvisionusingcrowdsourceddenseimageannotations. In pooling and recurrent neural networks. arXiv preprint arXiv,2016. arXiv:1512.04143,2015. [21] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet [4] A.X.Chang,M.Savva,andC.D.Manning.Learningspatial classification with deep convolutional neural networks. In knowledgefortextto3dscenegeneration. 2014. NIPS,2012. [5] Y.-W.Chao, Z.Wang, Y.He, J.Wang, andJ.Deng. Hico: [22] L.Ladicky,C.Russell,P.Kohli,andP.H.Torr. Graphcut A benchmark for recognizing human-object interactions in images. InICCV,2015. basedinferencewithco-occurrencestatistics. InEuropean ConferenceonComputerVision.Springer,2010. [6] Y.-W.Chao,Z.Wang,Y.He,J.Wang,andJ.Deng. Hico:A benchmarkforrecognizinghuman-objectinteractionsinim- [23] X. Liang, X. Shen, J. Feng, L. Lin, and S. Yan. Semantic ages. InProceedingsoftheIEEEInternationalConference objectparsingwithgraphlstm. InEuropeanConferenceon onComputerVision,2015. ComputerVision,2016. [7] K.Cho, B.VanMerrie¨nboer, D.Bahdanau, andY.Bengio. [24] W.Liao,M.Y.Yang,H.Ackermann,andB.Rosenhahn. On On the properties of neural machine translation: Encoder- supportrelationsandsemanticscenegraphs. arXivpreprint decoderapproaches. arXivpreprintarXiv:1409.1259,2014. arXiv:1609.05834,2016. [8] C.Desai,D.Ramanan,andC.Fowlkes.Discriminativemod- [25] D.Lin,S.Fidler,andR.Urtasun.Holisticsceneunderstand- elsforstatichuman-objectinteractions. In2010IEEECom- ingfor3dobjectdetectionwithrgbdcameras. InProceed- puter Society Conference on Computer Vision and Pattern ingsoftheIEEEInternationalConferenceonComputerVi- Recognition-Workshops.IEEE,2010. sion,pages1417–1424,2013. [9] C.Desai, D.Ramanan, andC.C.Fowlkes. Discriminative [26] C.Lu,R.Krishna,M.Bernstein,andL.Fei-Fei. Visualre- modelsformulti-classobjectlayout. Internationaljournal lationshipdetectionwithlanguagepriors. InEuropeanCon- ofcomputervision,95(1),2011. ferenceonComputerVision,2016. [10] M.Fisher,M.Savva,andP.Hanrahan. Characterizingstruc- [27] R.Mottaghi,X.Chen,X.Liu,N.-G.Cho,S.-W.Lee,S.Fi- tural relationships in scenes using graph kernels. In ACM dler,R.Urtasun,andA.Yuille.Theroleofcontextforobject SIGGRAPH2011papers,2011. detectionandsemanticsegmentationinthewild. InCVPR, [11] C.Galleguillos,A.Rabinovich,andS.Belongie.Objectcat- 2014. egorization using co-occurrence, location and appearance. [28] P.K.NathanSilberman,DerekHoiemandR.Fergus.Indoor In Computer Vision and Pattern Recognition, 2008. CVPR segmentation and support inference from rgbd images. In 2008.IEEEConferenceon.IEEE,2008. ECCV,2012. [12] R.Girshick.Fastr-cnn.InProceedingsoftheIEEEInterna- [29] A.OlivaandA.Torralba.Theroleofcontextinobjectrecog- tionalConferenceonComputerVision,2015. nition. Trendsincognitivesciences,11(12):520–527,2007. [13] R.Girshick, J.Donahue, T.Darrell, andJ.Malik. Region- [30] A. Rabinovich, A. Vedaldi, C. Galleguillos, E. Wiewiora, based convolutional networks for accurate object detection and S. Belongie. Objects in context. In 2007 IEEE 11th and segmentation. IEEE transactions on pattern analysis InternationalConferenceonComputerVision.IEEE,2007. andmachineintelligence,38(1),2016. [31] V. Ramanathan, C. Li, J. Deng, W. Han, Z. Li, K. Gu, [14] A.GuptaandL.S.Davis. Beyondnouns:Exploitingprepo- Y.Song,S.Bengio,C.Rossenberg,andL.Fei-Fei.Learning sitionsandcomparativeadjectivesforlearningvisualclassi- semantic relationships for better action retrieval in images. fiers. InEuropeanconferenceoncomputervision.Springer, In2015IEEEConferenceonComputerVisionandPattern 2008. Recognition(CVPR).IEEE,2015. 4329 [32] S.Ren,K.He,R.Girshick,andJ.Sun. FasterR-CNN:To- wards real-time object detection with region proposal net- works. InAdvancesinNeuralInformationProcessingSys- tems(NIPS),2015. [33] M. R. Ronchi and P. Perona. Describing common human visualactionsinimages. InBMVC,2015. [34] M. A. Sadeghi and A. Farhadi. Recognition using vi- sual phrases. In Computer Vision and Pattern Recognition (CVPR),2011IEEEConferenceon,2011. [35] R.Salakhutdinov,A.Torralba,andJ.Tenenbaum. Learning to share visual appearance for multiclass object detection. InComputerVisionandPatternRecognition(CVPR),2011 IEEEConferenceon.IEEE,2011. [36] K. Simonyan and A. Zisserman. Very deep convolutional networksforlarge-scaleimagerecognition. arXivpreprint arXiv:1409.1556,2014. [37] D.Teney,L.Liu,andA.v.d.Hengel. Graph-structuredrep- resentations for visual question answering. arXiv preprint arXiv:1609.05600,2016. [38] A.Torralba. Contextualprimingforobjectdetection. Inter- nationaljournalofcomputervision,53(2):169–191,2003. [39] B. Yao and L. Fei-Fei. Modeling mutual context of ob- jectandhumanposeinhuman-objectinteractionactivities. InComputerVisionandPatternRecognition(CVPR),2010 IEEEConferenceon.IEEE,2010. [40] M.Yatskar,L.Zettlemoyer,andA.Farhadi.Situationrecog- nition: Visualsemanticrolelabelingforimageunderstand- ing. 2016. [41] Y.ZhaoandS.-C.Zhu. Sceneparsingbyintegratingfunc- tion,geometryandappearancemodels.InProceedingsofthe IEEEConferenceonComputerVisionandPatternRecogni- tion,pages3119–3126,2013. [42] B.Zheng,Y.Zhao,J.Yu,K.Ikeuchi,andS.-C.Zhu. Scene understandingbyreasoningstabilityandsafety. Int.J.Com- put.Vis.,2015. [43] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, and P. Torr. Conditional random fieldsasrecurrentneuralnetworks. InInternationalConfer- enceonComputerVision(ICCV),2015. [44] C. L. Zitnick, D. Parikh, and L. Vanderwende. Learning thevisualinterpretationofsentences. InProceedingsofthe IEEEInternationalConferenceonComputerVision,2013. 4330

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.