ebook img

People Watching: Human Actions as a Cue for Single View Geometry PDF

18 Pages·2017·11.5 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview People Watching: Human Actions as a Cue for Single View Geometry

People Watching: Human Actions as a Cue for Single View Geometry David Fouhey, V. Delaitre, Abhinav Gupta, Alexei Efros, Ivan Laptev, Josef Sivic To cite this version: David Fouhey, V. Delaitre, Abhinav Gupta, Alexei Efros, Ivan Laptev, et al.. People Watching: Human Actions as a Cue for Single View Geometry. International Journal of Computer Vision, 2014, pp.17. ￿hal-01066257￿ HAL Id: hal-01066257 https://hal.inria.fr/hal-01066257 Submitted on 19 Sep 2014 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Noname manuscript No. (will be inserted by the editor) People Watching Human Actions as a Cue for Single View Geometry David F. Fouhey · Vincent Delaitre · Abhinav Gupta · Alexei A. Efros · Ivan Laptev · Josef Sivic Received:date/Accepted:date Abstract We present an approach which exploits the body pose, gestures, facial expressions, and eye move- couplingbetweenhumanactionsandscenegeometryto ments are all known to communicate a wealth of infor- use human pose as a cue for single-view 3D scene un- mation about a person, including physical and mental derstanding. Our method builds upon recent advances state,intentions,reactions,etc.Butmorethanthat,ob- in still-image pose estimation to extract functional and serving a person can inform us about the surrounding geometric constraints on the scene. These constraints environment with which the person interacts. are then used to improve single-view 3D scene under- Consider the two people detections depicted in Fig- standingapproaches.Theproposedmethodisvalidated ure 1. Can you tell which one of the three scenes these on monocular time-lapse sequences from YouTube and detections came from? Most people can easily see that still images of indoor scenes gathered from the Inter- it is room A. Even though this is only a static image, net. We demonstrate that observing people performing the actions and poses of the disembodied figures reveal different actions can significantly improve estimates of a lot about the geometric structure of the scene. The 3D scene geometry. pose of the left figure reveals a horizontal surface right under its pelvis ending abruptly at its knees. The right Keywords Scene understanding · action recognition · figure’s pose reveals a ground plane under its feet as 3D reconstruction well as a likely horizontal surface near the hand loca- tion. In both cases we observe a strong physical and functional coupling between people and the 3D geom- 1 Introduction etry of the scene. In this work, we aim to exploit this coupling. The human body is a powerful and versatile visual Thispaperproposestousehumanposeasacuefor communicationdevice.Forexample,pantomimeartists 3D scene understanding. Given a set of one or more canconveyelaboratestorylinescompletelynon-verbally images from a static camera, the idea is to treat each andwithoutprops,simplywithbodylanguage.Indeed, person as an “active sensor,” or probe that interacts withtheenvironmentandinsodoingcarvesoutthe3D D.F.Fouhey,A. Gupta,A.A.Efros free-space in the scene. We reason about human poses CarnegieMellonUniversity followingJ.J.Gibson’snotionofaffordances[19]–each RoboticsInstitute poseisassociatedwiththelocalgeometrythatpermits 5000ForbesAvenue,PittsburghPA15213 E-mail:[email protected] oraffordsit.Thisway,multipleposesinspaceandtime A.A. Efros now with the EECS department at UC Berkeley can jointly discover the underlying 3D structure of the V.Delaitre, I.Laptev,J.Sivic scene. INRIA,WILLOWproject, Inpractice,ofcourse,implementingthissimpleand D´epartement d’Informatique de l’E´cole Normale Sup´erieure, elegant scenario would be problematic. First of all, the ENS/INRIA/CNRSUMR 8548. 23,Avenued’Italie underlyingassumptionthatthehumansdenselyexplore 75013Paris, France the entire observed 3D scene is not realistic: in many scenes, humans may not interact with certain regions 2 David F.Fouheyet al. ? Sitting Reaching Room A Room B Room C Fig. 1 Whatcanhumanactionstellusaboutthe3Dstructureofascene?Quitealot,actually.Considerthepeopledepicted on the left. They were detected in a time-lapse sequence in one of rooms A, B, or C. Which room did they come from? See thetextforthe answer. formonths.Butmoreproblematicistheneedtorecover construction is thus a necessary step towards vision high-quality 3D pose information for all people in an systems with human-like capabilities. Furthermore, 3D image.Whileseveralverypromising2Dposeestimation scene estimates from a single image not only provide approaches exist [1,32,53], and while it is possible to a richer interpretation of the image but also improve use anthropometric constraints to lift the poses into performance of traditional single-image tasks such as 3D [49], the accuracy of these methods is still too low object detection [28,30]. to be used reliably. Past work on single-image 3D understanding has As a result, we take a soft, hybrid approach that overcome the underconstrained nature of the problem integratesappearancecuesaswell.Wefirstemploythe inavarietyofways.Forinstance,Kanadedemonstrated single-view indoor reconstruction method of Hedau et the recovery of 3D shapes from a single image in [33] al. [27] which produces a number of possible 3D scene with the overarching constraint that “regularities ob- hypotheses. We then use existing human detection ma- servableinthepicturearenotbyaccident,butaresome chinery to generate pose candidates. The crux of our projection of real regularities,” along with a variety of algorithm is in simultaneously considering the appear- regularities such as parallelism and skewed symmetry. ance of the scene and perceived human actions in a Advancesincomputingpoweraswellasincreasedavail- robustwaytoproducethebest3Dsceneinterpretation ability of data have led to work that learns to produce given all the available evidence. We evaluate our ap- 3D understandings from data. This has led to meth- proach on both time-lapses and still images taken from ods that take a single image and recover the dominant the Internet, and demonstrate significant performance 3Ddirectionsandhorizonlineofscenes[3],surfaceori- gainsoverstate-of-the-artappearance-onlymethods.We entations [17,29,40], volumetric objects [24,47,52], or additionally demonstrate the viability of using humans depth[34,45].Inmanyofthese,constraintsusedinthe as a cue without appearance and provide substantial past (e.g., parallelism) are integrated into the models’ analysis of how and when observing humans helps us objective functions. better understand scenes. There has been particular success in the field of in- doorsingle-view3Dunderstanding.Thishasbeenmade possible by the fusion of effective learning techniques 1.1 Background for single image 3D prediction such as [29] and low- dimensional(i.e.,highlyconstrained)modelsfor3Dlay- Our goal is to understand images in terms of 3D ge- out specifically tailored for indoor images [5,8,9,27,37, ometry and space. Traditional approaches in computer 48,51,55].Theconstraintsinvolvedusuallyconcernthe vision for 3D understanding have focused on using cor- human-madenatureofthescene:almostallapproaches respondences and multiple view geometry [26]. While make the Manhattan-world assumption in which there thesemethodshavebeensuccessful,theyarenotappli- arethreeorthogonalscenedirections,andinmostcases, cable when only a single view of the scene is available; it is assumed that the room can be roughly modeled accordingly, they are incapable of understanding the as a cuboid. However, although they assume a human- great wealth of consumer and historic photographs not centric scene structure, each of these approaches treats capturedwithmultipleviewsordepthsensors.Reason- humans as clutter rather than as a cue: for all previous ing about this sort of data is an enormous challenge work,thebestandarguablycorrectreactiontoahuman for computers and an open research question since it inthesceneistoclassifyherasclutterandtoignoreher is a wildly underconstrained problem. Nonetheless, it pixels in room layout estimation. In fact, despite being is trivial for any human being. Since humans can in- fundamentally dependent on the human-centric nature fer scene structure from a single image, single-view re- of the scenes, all recent work on room layout predic- People Watching 3 Sitting Walkable (a) Action and Pose Detections surfaces ... Sittable surfaces (b) Poses (potentially (c) Estimates of (d) 3D room (e) Final 3D scene understanding aggregated over time) functional surfaces geometry hypotheses Fig.2 Overviewoftheproposedapproach.Weproposetheuseofbothappearanceandhumanactioncuesforestimating single-viewgeometry.Givenaninputimageorsetofinputimagestakenbyafixedcamera,weestimatehumanposesineach image (a), yielding a set of human-scene interactions (b), which we aggregate over time (for time-lapses). We use these to to infer functional surfaces (c) in the scene, such as sittable (red) and walkable (blue). We simultaneously generate room hypotheses (d) from appearance cues alone. We then select a final room hypothesis and infer the occupied space in the 3D sceneusingbothappearance andhumanaction cues. tion has been evaluated on datasets containing exactly name, and used estimated 3D geometry in images to zeropeople.Giventhathumansandtheiractivitiesare infer Gibsonian affordances [19], or “opportunities for often the primary motivation for documenting scenes interaction”withtheenvironment.Jiangetal.[31]used andthathumanscenesareconstructedforhumans,this thesizesandposesofhumanstoinferhuman-objectaf- seems unnatural. This work aims to demonstrate that fordances from 3D scenes containing no humans. humansarenotanuisance,butratheranothervaluable source of constraints. Our work focuses on the inverse of the problem ad- dressed in [20,25]: we want to observe human actors, Other work on the interface between humans and infertheirposesandthenusethefunctionalconstraints image understanding has mostly focused on modeling from these poses to improve 3D scene understanding. theseconstraintsatasemanticlevel[11,23,50].Forex- Ourgoalistoharnesstherecentadvancesinpersonde- ample, drinking and cups are functionally related and tectionandposeestimation[1,4,14,32,53],anddesigna therefore joint recognition of the two should improve methodtoimprovesingle-viewindoorgeometryestima- performance. Semantic-level constraints have been also tion. Even though the building blocks of this work, hu- shown to improve object discovery and recognition [18, man pose estimation and 3D image understanding, are 41,50], action recognition [11,12,23,35], and pose esti- bynomeansperfect,weshowthattheycanberobustly mation [22,54]. Recently Delaitre et al. [10] proposed combined. We also emphasize our choice of the monoc- the use of poses for semantic segmentation of scenes; ular case, which sets our work apart from earlier work like ours, their work also uses poses as a cue, but it on geometric reasoning using human silhouettes [21] in solves the complementary problem of giving each pixel multi-view setups. In single-view scenarios, the focus inanimageasemanticlabel(e.g.,chair),notimproving has been on coarse constraints from person tracks [36, estimates of 3D scene geometry. 44,46], whereas we focus on fine-grained physical and In this paper we specifically focus on modeling re- functional constraints using human actions and poses. lationships at a physical level between humans and 3D scene geometry. In this domain, most earlier work has Apreliminaryversionofthisworkappearedas[16]. focusedonusinggeometrytoinferhuman-centricinfor- We clarify many technical details omitted in the pre- mation [20,25], or the question “what can a human do viousversion,presentresultsonsubstantiallyextended withagiven3Dmodel”.Forinstance,Guptaet al.[25] datasets, and offer an in-depth analysis of how, why, argued that functional questions such as “Where can I and when observing humans can improve understand- sit?” are more important than categorizing objects by ing of the 3D geometry of scenes. 4 David F.Fouheyet al. (a) Standing (b) Sitting (c) Reaching Fig.3 Exampleactiondetectionandposeestimationresultswiththearticulatedmodel.Thepredictedsurfacecontactpoints are shown by ellipses: blue (walkable), red (sittable), green (reachable). Shown actions are: standing (1-2), sitting (3-5), and reaching(6-8). 2 Overview Time-lapsedataisidealforourexperimentsformany reasons.Inourcase,thetimediscontinuityworksinour favor as it naturally compresses highly diverse person- Our work is an attempt to marry human action recog- scene interactions into a small number of frames: a nition with 3D scene understanding. We have made a time-lapse video lasting a few minutes may show many number of simplifying assumptions. We limit our fo- hours of events. Moreover, the static nature of the un- cus to indoor scenes: they allow for interesting human- derlying scene enables joint reasoning about multiple sceneinteractionsandseveralsuccessfulapproachesex- images without solving for camera pose to find a com- istspecificallyforestimatingtheirgeometryfromasin- mon reference frame. This lets us focus on the core of gle view [27,37]. We use a set of commonly observed our problem, rather than a structure-from-motion pre- physical actions, reaching, sitting, and walking, to pro- processing step. Finally, the time-lapses also enable us vide constraints on the free and occupied 3D space in to test our method on realistic data with non-staged the scene. To achieve this, we manually define surface activities in a variety of natural environments: time- constraints provided by each action, e.g., there should lapses are captured by consumers in their daily living be a sittable horizontal surface at the knee height for environments,frequentlywithviewinganglessimilarto thesittingaction.Weadoptageometricrepresentation consumer still photographs. Our datasets, for instance, that is consistent with recent methods for indoor scene contain time-lapses gathered from a consumer video layout estimation. Specifically, each scene is modeled sharing site, YouTube.com. in terms of the layout of the room (walls, floor, and An overview of our approach is shown in Figure 2. ceiling) and the 3D layout of the objects. It is assumed First,wedetecthumansperformingdifferentactionsin thattherearethreeprincipaldirectionsinthe3Dscene the image and use the inferred body poses to extract (theManhattanworldassumption[6])andthereforees- functional regions in the image such as sittable and timating a room involves fitting a parametric 3D box reachablesurfaces(Section3).Fortime-lapses,weaccu- that is aligned with the vanishing points. mulate these detections over time for increased robust- While temporal information can be useful for de- ness. We then use these functional surface estimates tectinghumanactionsandimposingfunctionalandge- to derive geometrical constraints on the scene. These ometrical constraints, in this work, we only deal with constraints are combined with an existing indoor scene stillimagesandtime-lapsevideoswithnotemporalcon- understandingmethod[27]topredicttheglobal3Dge- tinuity. Time-lapses are image sequences recorded at a ometry of the room by selecting the best hypothesis lowframerate,e.g.,oneframeasecond.Suchsequences from a set of hypotheses (Section 4.1). Once we have are often shot with a static camera and show a variety the global 3D geometry, we can use these human poses of interactions with the scene while keeping the static toreasonaboutthefree-spaceofthescene(Section4.2). scene elements fixed. People use time lapses to record and share summaries of events such as home parties or family gatherings. Videos can nonetheless be used in 3 Local Scene Constraints from People’s the proposed framework without any modifications by Actions ignoringthetemporalinformationandtreatingthemas a time-lapse, or by substituting our single-frame pose Our goal is to predict functional image regions corre- 1 estimators with approaches that integrate temporal in- sponding to walkable, sittable and reachable surfaces by formation [2,39,42]. analyzing human actions in the scene. We achieve this People Watching 5 bydetectingandlocalizingpeopleperformingthethree different actions (standing, sitting, reaching) and then using their pose to predict contact points with the sur- faces in the scene. For time-lapses, contact points are aggregated over multiple frames to provide improved evidenceforthefunctionalimageregions.Weillustrate these contact points on detected humans in Fig. 3. Given a person detected performing an action, we predict contacts with surfaces as follows: (i) for walk- able surfaces we define a contact point as the mean location of the feet position, and use all three types of actions; (ii) for sittable surfaces, we define a con- tact point at the mean location of the hip joints, and consider only sitting actions; and (iii) for reachable Fig. 4 Example detections with the deformable part mod- surfaces, we define a contact point as the location of els (top row: sitting; bottom row: standing) and approx- imated joint locations from bounding boxes (red: pelvic the hand further from the torso, and use only reach- joint/sittable;blue:feet/walkable) ing actions. These surfaces are not mutually exclusive (e.g., the tops of beds are sittable and reachable) and are estimated independently. To increase robustness to poses by simply transferring a fixed pose with respect mistakes in localization, we place a Gaussian at the to the bounding box. We find that these two methods contact points of each detection and weight the contri- havecomplementarystrengthsanderrormodes.Exam- bution of the pose by the classifier confidence; one can ples of detected actions together with estimated body also equivalently view this as each contact point vot- poseconfigurationsandpredictedcontactpointsforthe ing for the properties of the scene with higher weight articulatedmodelareshowninFigure3andforthede- placedonnearbylocationsandondetectionswithhigh formable parts model in Figure 4. confidence. The standard deviation of each Gaussian Thearticulatedposeestimatoranddeformableparts is set to a fraction of the detection bounding box, 1/4 model are trained and used separately and produce in- in X- and 1/40 in Y-direction, respectively; we use the dependent estimates of functional regions. These func- boundingboxdimensionstoautomaticallyscaleourre- tionalregionsareintegratedseparatelyinourroomlay- gionofuncertaintywithhumanproportions.Thisyields out ranking function in Equation 2. probability maps h for the different types of functional image regions, as illustrated in Figures 2c and 5c,d. Since a sitting detector may also respond to stand- 4 Space Carving Via Humans ing people, we discriminate between different actions by converting the detection score of each model into a In the previous section we discussed how we estimate probability by fitting a decreasing exponential law on humanposesandfunctionalregionssuchassittableand theirfiringrate.Actionclassificationisperformedwith walkable surfaces. Using the inferred human poses, we non-maximum suppression: if bounding boxes of sev- now ask: “What 3D scene geometry is consistent with eral detections overlap irrespective of action class, the these human poses and functional regions?” We build detection with the highest calibrated response is kept. upon [25], and propose three constraints that human Our approach is agnostic to the particular method poses impose on 3D scene geometry: ofposedetectionandonlyrequiresadetectorthatpro- Containment.Thevolumeoccupiedbyahumanshould duces a class (e.g., sitting) as well as estimates for the be inside the room. relevant joints of the human (e.g., pelvic joint, feet). In Free space. The volume occupied by a human cannot this work, we use two complementary approaches. We intersect any objects in the room. For example, for a build primarily on the articulated pose model of Yang standingpose,thisconstraintwouldmeanthatnovox- and Ramanan [53]. Here, we employ the model for de- els below 5ft can be occupied at standing locations. tecting human action by training a separate model for Support. There must be object surfaces in the scene each of the three actions. To supplement the articu- which provide sufficient support so that the pose is latedmodel,weusethedeformablepartsmodel(DPM) physically stable. For exam1ple, for a sitting pose, there of Felzenszwalb et al. [14] for sitting and standing: the mustexistahorizontalsurfacebeneaththepelvis(such lowvarianceoftherelevantjointsoftheseactions(e.g., asachair).Thisconstraintcanalsobewritteninterms feet for standing) enable us to accurately approximate of the functional regions; for example, the backpro- 6 David F.Fouheyet al. (a) Input Image (b) Pose Detections (c) Walkable (d) Reachable Fig. 5 Predicting functional image regions. (a) An image from a time-lapse sequence. (b) Overlaid example person detectionsfromdifferentframes:standing(blue),reaching(green).(c,d)Probabilitymapsofpredictedlocationsfor(c)walkable and(d)reachablesurfaces.Notethatthe twofunctionalsurfacesoverlaponthefloor. jected sittable regions must be supported by occupied Given input image features x and the observed hu- voxels in the scene. man actors H (represented by functional surface prob- Our goal is to use these constraints from observed ability maps h ∈ H), our goal is to find the best room human poses to estimate room geometry and the oc- layout hypothesis y∗ ∈Y. We use the following scoring cupied voxels in the scene. Estimating voxels occupied functiontoevaluatethecoherenceofimagefeaturesand by the objects in the scene depends on the global 3D human poses with the hypothesized room layout y: room layout as well as the free-space and support con- straints.Ontheotherhand,estimating3Droomlayout f(x,H,y)=αψψ(x,y)+αφφ(H,y)+αρρ(y), (1) isonlydependentonthecontainmentconstraintandis independent of the free-space and support constraints. where ψ(x,y) measures the compatibility of the room Therefore, we use a two-step process: in the first step, layoutconfigurationywiththeestimatedsurfacegeom- we estimate the global 3D room layout, represented by etry computed using image appearance, φ(H,y) mea- a 3D box, using appearance cues and the containment sures compatibility of human poses and room layout, constraintsfromhumanactors.Thisisdonebyranking and ρ(y) is a regularizing penalty term on the relative a large collection of room hypotheses and selecting the floor area that encourages smaller rooms; the αs trade- top-ranked hypothesis. In the second step, we use the off between the compatability terms. estimated box-layout to estimate the occupied voxels As we build upon the code of Hedau et al., the first inthescene.Here,wecombinecuesfromsceneappear- term, ψ(x,y) is the scoring function learned via Eqns. anceandhumanactorstocarveoutthe3Dspaceofthe 3-4 of [27]. The function uses global appearance cues, scene. such as detected straight lines and the per-pixel classi- fier predictions for different surface labels (walls, floor, ceiling).Thesecondtermenforcesthecontainmentcon- straints and expands as 4.1 Estimating Room Layout (cid:88) φ(H,y)= α ϕ(ζ(h),y), (2) φ,h Given an image and the set of observed human poses, h∈H we want to infer the global 3D geometry of the room. We build on the approach of Hedau et al. [27], which where ζ(h) is the mapping of functional surfaces onto ranks a collection of vanishing-point aligned room hy- the ground plane and ϕ measures the normalized over- pothesesaccordingtotheiragreementwithappearance lap between the projection and floor in the hypothe- features. However, estimating the room layout from a sizedroomlayout.Intuitively,φ(H,y)enforcesthatthe single view is a difficult problem and it is often almost projection of both the human body and the objects it impossible to select the right layout using appearance is interacting with should lie inside the room. The α φ,h cues alone: frequently, there are a handful of top hy- terms trade off the weightings of the functional sur- potheses with inadequate evidence to decide which is faces. In the current system, we do not enforce that correct. We propose to further constrain the inference thebodydoesnotintersecttheceiling,althoughthisis problem by using the containment constraint from hu- not an issue in practice. We approximate ζ(h) by using manposes.Thisisachievedwithascoringfunctionthat the feet locations of detected actors, which produces uses appearance terms as in [27] as well as terms to 1 accurate results for our action vocabulary. Finally, the evaluate to what degree the hypothesized room layout term ρ(y) = −max(0,(A−M)/M) imposes a penalty is coherent with observed human actors. forexcessivefloorareaA,measuredwithrespecttothe People Watching 7 Clutter Backprojected Functional Regions Free Space Walkable Sittable Reachable Fig. 6 We estimate the free space of the scene by taking the backprojected clutter map and refining it with backprojected functionalregions.Voxelsaboveallbackprojectedfunctionalregionsrecievevotesagainstbeingoccupiedduetothefreespace constraint; voxels below reachable and sittable regions (i.e., within the volumes) receive votes in favor of being occupied due tothesupportconstraint. Theresultofcombiningallcuesisshownontheright. minimum floor area M out of the top three hypothe- al.[27],andthenincorporateconstraintsfromdifferent ses. We include this regularization term since φ(H,y) human poses to further refine this occupied voxel map. can only expand the room to satisfy the containment Specifically, we backproject each functional region h at constraint. its3Dheight,yieldingahorizontalsliceinsidethevoxel Therelativeweightsαonthetermsarelearnedina map. Because our classes are fine-grained, we can use leave-one-out fashion. One term, α , is held constant, human dimensions (waist height) for the heights of the ψ and grid search is performed on a fairly coarse grid (2i sitting and reaching surfaces. This slice is then used fori={−4,...,2}fortheotherαs.Fortheregulariza- to cast votes above and below in voxel-space: votes in tionterm,weincludetheadditionaloptionofsettingα favor of occupancy are cast in the voxels below; votes ρ to0.Wechosetheweightingthatresultsinthehighest against occupancy are cast in the voxels above. This is meanperformancewhenchoosingthetop-rankedroom illustrated in Fig. 6. The final score for occupancy of a onthetrainingset.However,wenotethatthesystemis particularvoxelisalinearsumofthesevotes;asthere- fairly insensitive to the particular αs: on no parameter sult is probabilistic, to produce a binary interpretation setting does the system produce worse results than the as shown in the figures, we must threshold the results. appearance-only system on any dataset. Weselectourroominterpretationfromthesamehy- pothesis pool Y as [27]. First, a modified version of the 5 Experiments vanishing point detector of Rother [43] estimates three orthogonal vanishing points in the scene. The discrete In this section, we describe experiments done to vali- poolofhypothesesisgeneratedbydiscretizingthespace date our contributions. We introduce the experimental of all vanishing-point aligned boxes. We rank each hy- setup,thedatasets,andrawquantitativeresultsaswell pothesis with Equation 1 and return the top-scoring as some qualitative results. A detailed analysis of how hypothesis as our predicted layout. observingpeoplechangesroomlayoutestimationispre- sented in Section 6. 4.2 Estimating Free Space in the Scene Aspreface,wenotethattheprimarycontributionof this work is the demonstration that humans can serve Once we have estimated the room layout we now es- as a valuable cue for single-view geometry problems timate the voxels occupied by objects. However, this in practice. Although we present a way of integrat- is a difficult and ill-posed problem. Hedau et al. [27] ing functional surfaces into single-view reasoning with use an appearance based classifier to estimate pixels a particular appearance-based system operating in a corresponding to objects in the scene. These pixels are particular paradigm, there are many other ways of re- thenback-projectedundertheconstraintthateveryoc- coveringthelayout,surfaces,or3Dstructureofaroom cupied voxel must be supported. Lee et al. [37] and from a single view. The purpose of this work is not to Gupta et al. [25] further constrain the problem with demonstrate that our particular system out-performs domain-specific cuboid object models and constraints all other approaches; instead, it is to demonstrate that such as “attachment to walls.” We impose functional affordance-based cues offer complementary evidence to constraints: a human actor carves out the free space appearance-based ones. Accordingly, we design our ex- and support surfaces by interacting with the scene. periments and analysis to investigate how our system The room layout and camera calibration gives a performs relative to a system using the same appear- cuboidal 3D voxel map in which we estimate the free ance features and set of hypotheses. space.WefirstbackprojectthecluttermaskofHedauet 8 David F.Fouheyet al. Input Image Functional Regions Scene Geometry Fig. 7 Example time-lapse sequence results: given an input image, we use functional regions (walkable: blue; sittable: red; reachable: green) to constrain the room layout; having selected a layout, we can also infer a more fine-grained geometry oftheroomviafunctionalreasoning. (a) Appearances Only (Hedau et al). 1 (b) Appearances + People (Our approach). Fig. 8 Timelapse experiment: A comparison of (a) the appearance only baseline [27] with (b) our improved room layout estimates. In many cases, the baseline system selects small rooms due to large amounts of clutter. On the right, even though theroomisnotpreciselya cuboid,ourapproachisabletoproduceabetterinterpretation ofthescene. 1 People Watching 9 Baselines and evaluation criteria. For both time- or more people engaged in everyday activities interact- lapses and single images, we compare our estimates ing with the scene. Comparison on existing datasets is of room geometry to a number of baselines. Our pri- impossible since previous work on room layout estima- mary baseline is the appearance-only system of Hedau tionhasignoredpeopleasacueandhasbeenevaluated et al.[27],whichweusetoprovideappearancecues.To on datasets composed entirely of unoccupied rooms. provide context, we also include another baseline, in whichweimposetheboxmodelontheoutputofLeeet al. [38] that maximizes the agreement. Finally, to show Time-lapse data. For time-lapse videos, we present that all methods are operating better than chance, we results on the dataset introduced in [16], as well as on use location alone to predict the pixel labels, akin to the larger semantic segmentation dataset presented in a per-pixel prior: after resizing all scenes to common [10], which contains the dataset of [16] as a subset. We dimensions, we use the majority label in the training refertothedatasetof[16]asthePeopleWatchingtime- images for each pixel. lapses, and that of [10] as the Scene Semantics time- Toquantitativelyevaluatehowwellweestimatethe lapses.Wehavere-labeledtheScene Semanticsdataset layout of an image, we use the standard metric of per- for room layout prediction. pixel accuracy (i.e., treating the problem as a seman- Both datasets were collected from Youtube using tic segmentation one). Note that since the camera is keywords such as “time-lapse,” “living room,” “party,” fixed in time-lapses, the scene can be summarized with or“cleaning.”ThePeopleWatchingdatasetcontains40 a single image (the one provided to the appearance- videos with about 140,000 frames. The Scene Seman- only approaches) and thus only a single annotation is ticsdatasetcontains146videos,totalingabout400,000 needed. In some time-lapses, the camera is adjusted or frames. zoomed slightly; this only impacts our approach, and not the apperance-only approaches. When aggregating UnlikethePeople Watchingdataset,whichcontains a statistic over a dataset, we quantify our uncertainty largelyunambiguousandcuboidalbedroomsandliving regarding the statistic via boostrapped confidence in- rooms, the Scene Semantics dataset contains a wider tervals,whichwecomputewiththeBias-CorrectedAc- variety of scene classes as well as many scenes that celerated method [13], using 10,000 replicates. violate the assumptions of our method. These include straight-forwardviolationsofexplicitassumptions,such Implementation details. We train detectors using asthatnomorethanthreewallsarevisible.Thesealso the Yang and Ramanan model for all three actions [53] include violations of more subtle implicit assumptions, andtheFelzenszwalbetal.modelforsittingandstand- such as that the full body of a person will be within ing.Forthestandingaction,weuseasubsetof196im- the frame and that the floor will occupy some reason- ages from [53] containing standing people. For sitting able fraction of the scene: 10% of scenes in the Scene and reaching, we collect and annotate 113 and 77 new Semanticsdatasethavefloorcoverageof15%orless,as images, respectively. All images are also flipped, dou- comparedtononeinthePeopleWatchingdataset.This bling the training data size. As negative training data leads to truncated detections on the bottom of the im- we use the INRIA outdoor scenes [7], indoor scenes agethatdonotactuallyrestonthefloor.Inthehandful from [27], and a subset of Pascal 2008 classification ofcaseswherethecuboidalmodelisexplicitlyviolated, training data. None of the three negative image sets onlytheunambiguouspartsareannotated:forinstance contains people. On testing sequences, adaptive back- if four walls are visible due to a fish-eye lens, only the ground subtraction is used to find foreground regions floor and ceiling are annotated. ineachframe,whichareusedastheappearanceforthe time-lapse and to remove false-positive detections on the background. We also use geometric filtering similar Still image data.Ourpreviouswork[16]introduceda to [30] to remove detections that significantly violate datasetof100stillimagesofindoorsceneswithpeople. the assumption of a single ground plane. We expand this dataset to 500 images. These images were retrieved from the Internet with queries such as 5.1 Datasets “living room” and “waiting room” with the criterion that they are roughly cuboidal and that they contain Wetesttheproposedapproachonconsumertime-lapse at least one person. These non-staged images depict videos and a collection of indoor still images. The data celebrities,politicalfigurs,andordinarypeopleengaged for both originates from the Internet and depicts chal- ineverydaytasks,rangingfromsimplysittingandtalk- lenging,cluttered,andnon-stagedscenescapturingone ing to having meetings or parties.

Description:
Advances in computing power as well as increased avail- ability of data have . multi-view setups. In single-view . duces a class (e.g., sitting) as well as estimates for the relevant joints of . terms trade off the weightings of the functional sur- faces. ilarly, our system is confused by mirrors o
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.