ebook img

Modeling Grasp Motor Imagery through Deep Conditional Generative Models PDF

1.3 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Modeling Grasp Motor Imagery through Deep Conditional Generative Models

IEEEROBOTICSANDAUTOMATIONLETTERS.PREPRINTVERSION.ACCEPTEDDECEMBER,2016 1 Modeling Grasp Motor Imagery through Deep Conditional Generative Models Matthew Veres, Medhat Moussa, and Graham W. Taylor objectsunderuncertaintyandinhighlyclutteredenvironments. Abstract—Grasping is a complex process involving knowledge Second, while the object shape and location can be obtained of the object, the surroundings, and of oneself. While humans from perception, grasping is often challenged by inherent are able to integrate and process all of the sensory information characteristics of the object such as surface friction, weight, required for performing this task, equipping machines with this capability is an extremely challenging endeavor. In this center of mass, and finally the actual functionality of the 7 paper, we investigate how deep learning techniques can allow object. All of these factors are only known after the object 1 us to translate high-level concepts such as motor imagery to is touched and the grasp is started. 0 the problem of robotic grasp synthesis. We explore a paradigm In this paper, we propose to learn a new concept that we 2 basedon generativemodels forlearning integratedobject-action representations, and demonstrate its capacity for capturing and refer to as the grasp motor image (GMI). The GMI combines n generating multimodal, multi-finger grasp configurations on a objectperceptionandalearnedpriorovergraspconfigurations, a simulated grasping dataset. to synthesize new grasps to apply to a different object. We J Index Terms—Grasping, Visual Learning, Multifingered liken this approach to the use of motor representations within 1 Hands, Deep Learning, Generative Models humans.Specifically,wefocusontheuseofmotorimageryfor 1 creating internal representations of an action, which requires ] some knowledge or intuition of how an object may react in a O I. INTRODUCTION given scenario. R HUMANS have an innate ability for performing complex We show that using recent advances in deep learning (DL), grasping maneuvers. Often times, these maneuvers are . s performedunconsciously,whereobjectdynamicsareunknown specifically deep conditional generative models [27] and the c StochasticGradientVariationalBayes(SGVB)framework[7], [ or continuously changing through time. This ability also we can capture multimodal distributions over the space of manifests where objects themselves may be either similar 1 grasps conditioned on visual perception, synthesizing grasp or novel to those previously encountered. Given some prior v configurations with minimal additional complexity compared 1 experience on grasping an object, it seems highly unlikely to conventional techniques such as convolutional neural net- 4 that humans learn from scratch how to grasp each new object 0 that is presented to them. Rather, we posit that this ability is works (CNNs). We quantitatively compare our approach to 3 the discriminative CNN baseline and other generative models driventhroughbothmotorandobjectrepresentations,allowing 0 and qualitatively inspect samples generated from the learned for abstract generalizations and efficient transference of skills . 1 among objects. distribution. 0 In robotics, grasping and manipulation is a critical and 7 challengingproblem.Itsdifficultystemsfromvariabilityinan 1 : object’sshapeandphysicalproperties,thegrippercapabilities, A. Contributions v and task requirements. As such, most industrial applications i X require robots to use myriad gripping fixtures or end-of- Mostworkwithindeeplearningandroboticgraspsynthesis r arm tools to grasp various objects. But as robots expand to has focused in one form or another on the prediction of grasp a applications in non-structured environments (both in industry configurationsgivenvisualinformation.Thegoalofthiswork and beyond), advanced grasping skills are needed. istoshowhowhavinganideaofthepropertiescharacterizing Currently there are several difficulties in actually learning an object, and an idea of how a similar object was grasped how to grasp. First, the problem is fundamentally a many- previously, a unified space can be constructed that allows to-many mapping. An object can be grasped in many ways grasps to be generated for novel objects. that are all equivalent, while the same grasp configuration can Asecondcontributionofthisworkisaprobabilisticframe- be used to grasp many objects. There is a need to maintain work,leveragingdeeparchitecturestolearnmultimodalgrasp- this many-to-many mapping to enable the robot to grasp ing distributions for multi-fingered robotic hands. Grasping is inherently a many-to-many mapping, yet as we show in this Manuscript received: September, 10, 2016; Revised November, 30, 2016; AcceptedDecember,27,2016. paper, na¨ıvely applying mainstream deep learning approaches This paper was recommended for publication by Editor Han Ding upon (e.g. convolutional neural networks) may fail to capture these evaluation of the Associate Editor and Reviewers’ comments. This work complexdistributionsorlearninthesescenarioswithoutsome is supported by the Natural Sciences and Engineering Research Council of Canada,andtheCanadaFoundationforInnovation. stochasticcomponents.Here,wedemonstratethefeasibilityof Authors are with the School of Engineering, University deepgenerativemodelsforcapturingmultimodaldistributions of Guelph, 50 Stone Road East. Guelph, Ontario, Canada. conditional on visual input, and open the door to future work {mveres,mmoussa,gwtaylor}@uoguelph.ca DigitalObjectIdentifier(DOI):seetopofthispage. such as the integration of other sensory modalities. (cid:13)c2017IEEE.Personaluseofthismaterialispermitted.PermissionfromIEEEmustbeobtainedforallotheruses,inanycurrentorfuturemedia,including reprinting/republishingthismaterialforadvertisingorpromotionalpurposes,creatingnewcollectiveworks,forresaleorredistributiontoserversorlists,or reuseofanycopyrightedcomponentofthisworkinotherworks. 2 IEEEROBOTICSANDAUTOMATIONLETTERS.PREPRINTVERSION.ACCEPTEDDECEMBER,2016 II. BACKGROUNDANDRELATEDWORK actions interact with each other, and what effects they cause. Bayesian networks have been used to explore these effects by Within robotics, our work shares some similarities to the notion of experience databases (ED), where prior knowledge Montesano et al. [16], who use discretized high-level features and learn the structure of the network as part of a more is used to either synthesize grasps partially (i.e. through generalized architecture. Other work with Bayesian networks constraints or algorithmic priors [14]) or fully, by using include Song et al. [28], who explore how large input spaces previously executed grasps. Our work is also similar in spirit to the concept of Grasp Moduli Spaces [21], which define a can be discretized efficiently for use in grasping tasks. continuous space for grasp and shape configurations. Our approach differs significantly from these works in the Yet,insteadofclassicalapproacheswithEDs(whichrequire scaleofdata,application,andnetworkstructures.Further,with manual specification of storage and retrieval methods), our respect to Bayesian networks, we work with continuous data approach allows these methods to be learned automatically, without any need for discretization. for highly abstract representations. Further, our approach to constructing this “grasp space” is to collect realistic data on C. Deep learning and robotic grasping object and grasp attributes using sensory modalities and hand The majority of work in DL and robotic grasping has configurations, and learn the encoding space as an integrated focusedontheuseofparallel-plateeffectorswithfewdegrees object-action representation. of freedom. Both Lenz [9], and Pinto [20] formulate grasping as a detection problem, and train classifiers to predict the A. Motor imagery most likely grasps through supervised learning. By posing Motor imagery (MI) and motor execution (ME) are two grasping as a detection problem, different areas of the image differentformsofmotorrepresentation.Whilemotorexecution could correspond to many different grasps and fit with the is an external representation (physical performance of an multimodality of grasping; yet, to obtain multiple grasps for action), motor imagery is the use of an internal representation the same image patch, some form of stochastic components for mentally simulating an action [5], [17]. or a priori knowledge of the types of grasps is required. As MI is an internal process, assessing MI performance Mahleretal.[14]approachtheproblemofgraspingthrough is typically done by analyzing behavioural patterns (such as the use of deep, multi-view CNNs to index prior knowledge response time for a given task), or by visualizing neural ac- of grasping an object from an experience database. Levine et tivity using techniques such as functional magnetic resonance al.[11]worktowardsthefullmotionofgraspingbylinkingthe imagining (fMRI). Using these techniques, many studies have prediction of motor commands for moving a robotic arm with reported overlapping activations in neural regions between theprobabilitythatagraspatagivenposewillsucceed.Other both MI and ME (e.g. [4], [26], review: [18]), suggesting that workonfull-motionroboticgraspingincludesLevine[10]and at some level, participating in either MI or ME affects some Finn[2]wholearnvisuomotorskillsusingdeepreinforcement amountofsharedrepresentation.Thesefindingsleadcredence learning. to the hypothesis that, for example, mentally simulating a DL and robotic grasping have also recently extended to the grasping task shares some measure of similar representation domain of multi-fingered hands. Kappler et al. [6] used DL to actually performing the grasp itself. to train a classifier to predict grasp stability of the Barrett Among many studies that have examined this phenomenon, hand under different quality metrics. In this work, rather than we highlight one by Frak et al. [3] who explored it in the treating grasping as a classification problem, we propose to contextofwhichframeofreferenceisadoptedduringimplicit predict where to grasp an object with a multi-fingered hand, (unconcious)MIperformance.Theauthorspresentedevidence through a gripper-agnostic representation of available contact that even though MI is an internal process, participants men- positions and contact normals. tallysimulatingagrasponawatercontainerdidsounderreal- worldbiomechanicalconstraints.Thatis,graspsoractionsthat III. GRASPMOTORIMAGERY would have been uncomfortable to perform in the real world Although the field is far from a consensus on the best uses (e.g. due to awkward joint positioning) were correlated with andstructuresforDLandroboticgrasping,mostworkappears responses of the mentally simulated action. to use deep architectures for processing visual information. At the core of our approach is the autoencoder structure. We B. Learning joint embeddings and object-action interactions briefly review the principles of this method before reviewing Learning joint embeddings of sensory and motor data is the probabilistic models considered herein. not new. It has been proposed, e.g., by Uno et al. [30], who attemptedtolearnajointembeddingbetweenvisualandmotor A. Autoencoders information.Inthefieldofrobotics,otherworksthathaveused multimodalembeddingsincludeSergeantetal.[25]forcontrol An autoencoder (AE) is an unsupervised deep learning of a mobile robot, Noda et al. [19] for behaviour modeling in algorithmthatattemptstodissociatelatentfactorsindatausing a humanoid robot, and recently Sung, Lenz, and Saxena [29] a combination of encoding and decoding networks (Figure who focus on transferring trajectories. 1). In the encoding stage, an input x is mapped through Congruent to learning joint embeddings, there has also a series of (typically) constricting nonlinear hidden layers been work in robotic grasping on learning how objects and to some low-dimensional latent representation of the input VERESetal.:MODELINGGRASPMOTORIMAGERYTHROUGHDEEPCONDITIONALGENERATIVEMODELS 3 f(x). In autoencoders that have an odd number of layers, this The objective of the VAE shares some similarities to the layer is often denoted by z. The decoding phase forces the classical AE; that is, in the case of continuous data the opti- learnedrepresentationtobemeaningfulbymappingittosome mization of a squared error objective along with an additional reconstructiong(f(x))oftheoriginalinput.Trainingproceeds regularizing term. This objective is denoted as the variational by iteratively encoding and decoding a datapoint, measuring lower bound on the marginal likelihood: the difference between the reconstruction and original input, logp (x)≥−D (q (z|x)||p (z)) (1) and backpropagating the error to update the weights of both θ KL φ θ (cid:104) (cid:105) the encoder and decoder. +E logp (x|z) qφ(z|x) θ x g z h x where DKL is the KL-Divergence between the encoding and prior distribution, analagous to a regularization term in a standardautoencoder.Theformofthep (x|z) willdependon θ W1 W2 W3 W4 the nature of the data, but typically is Bernoulli or Gaussian. Note that the VAE formulation only permits continuous latent variables. Commonly, both the prior p (z) and encoding distributions θ are chosen to be multivariate Gaussians with diagonal co- Fig. 1: Autoencoder structure variance matrices N(µ,σ2I), in which case the recognition network learns to encode µ and σ for each latent variable. While the AE is capable of untangling the latent factors This simple parameterization allows for the KL-divergence to of complex data one limitation of this architecture is that be computed analytically without the need for sampling. In autoencoders are inherently deterministic in nature; a single ordertocomputetheexpectation,thereparameterizationtrick inputwillalwaysproducetheexactsamelatentrepresentation introducedbyKingmaandWelling[7]reparameterizes(anon- and reconstruction. For tasks such as robotic grasping, which differentiable)zthroughsomedifferentiablefunctiong (x,(cid:15)): φ isinherentlyamany-to-manymapping(i.e.manydifferentrep- resentations may map to many different grasps), deterministic z=µ+σ(cid:15) (2) networkssuchastheAEareunsuitableforimplementingsuch a mapping in a generative sense. where(cid:15)issampledfromthenoisedistributionp((cid:15))=N(0,I), and µ, σ are the mean and standard deviation of the encoding distributionrespectively.Thus,anestimateofthelowerbound B. Variational autoencoders can be computed according to: The variational autoencoder (VAE) [7], [22] is a directed graphical model composed of generator and recognition net- L (θ,φ;x)=−D (q (z|x)||p (z)) (3) VAE KL φ θ works (Figure 2). The goal of the recognition network is to L 1 (cid:88) learn an approximation to the intractable posterior p (z|x) by + logp (x|z(l)) θ L θ usinganapproximateinferencenetworkqφ(z|x)throughsome l=1 nonlinear mapping, typically parametrized as a feedforward where z is reparameterized according to Equation 2, and L is neuralnetwork.Thegeneratornetworktakesanestimateofz, the number of samples drawn from the prior distribution for and learns how to generate samples from pθ(x|z), such that computing the expectation1. (cid:80) p (x) = p(x|z)p(z) approximates the data distribution θ z Somewhat abstracted from view, but fundamental to the p(x). decisiontopursuetheminthisworkisthatVAEsarenotonly probabilistic models, but that the stochasticity of z allows for z z modeling multiple modes within the data distribution. That is, sampling different z’s may localize the network’s predictions to different high-probability regions of the reconstructed out- g h putspace.In oursetup,weonlyconditionon visualinputand sample grasps, not vice-versa. Therefore we do not explicitly x x treat the many-to-many mapping. However, VAEs may also permitthejointmodelingofgrasp,vision,andotherperceptual (a) Generator network (b) Recognition network inputs,suchastactilesensors.Thisisoutofapplicationscope Fig. 2: Variational AE generator and recognition networks. and reserved for future work. Thesenetworkscanbecomposedinafashionsimilartothe C. Conditional variational autoencoders classical AE, where the recognition network forms a kind of The conditional variational autoencoder (CVAE) [27], is an encoder, and generator network constitutes the decoder. This extension of the VAE architecture to settings with more than is, in fact the proposed method of training them within the one source of information. Given an input x, output y, and SGVB framework [7], which adds only the complexity of samplingtoclassicalmethods,andstillperformsupdatesusing 1Inmanycases(suchaslargeminibatchsizes),onlyasinglesampleneeds the backpropagation algorithm. tobedrawn. 4 IEEEROBOTICSANDAUTOMATIONLETTERS.PREPRINTVERSION.ACCEPTEDDECEMBER,2016 Y ccoonnttaacctt pnoosrmitiaolnss 18 18 ncat Z o c ⌢ 82 64 5 ple z 4 channel RGB-D recognition networCkNN module 64 ~sam 5 18 wise sum X CNN module 64 5 + elem. 18 18 18 Y prior network generated contact positions generator network contact normals CNN module 64 18 prediction network (a) CVAE architecture we use in our experiments. Dotted arrows denote components used during training, dashed arrows are components used during testing, and solid arrows are used for both training and testing. The CNN Module is expanded in (b). g X 4 channel RGB-D con1svi6z1 efi:l7txe7rs con3svi2z2 efi:l t5exr5s pool1 con3siv2z3 efi:l t5exr5s con6siv4z4 efi:l t5exr5s pool2 co6sni4vz e5fi:l t3exr3s uler + NIN uler + NIN conv5, out niloop .gva labolg uler + .nnoc ylluf 64 64 64 64 (b) CNN module for processing visual information with Network-in-Network (NIN) layers [12]. Fig. 3: Schematic of our method. (a) During training, the recognition network p(z|x,y) learns an encoding distribution for the image and contact positions/normals, while the generator network p(y|x,z) takes a sample z, along with a representation from the prediction network to generate a sample grasp. The prior network p(z|x) is used to regularize the distribution learned by the recognition network (via the KL-Divergence term of Equation 4), and is also used to sample a z during testing, as the network does not have access to grasp information. In the R-CVAE network, the structure of the prior network matches that of the recognition network, but takes a predicted grasp (made by the prediction network) as input. (b) All visual recognition modules are CNNs with NIN. latent variables z, conditional distributions are used for the sampling (Equation 6), the latter typically requiring a fewer recognitionnetworkq (z|x,y),generatornetworkp (y|x,z), number of samples. φ θ and for a prior network p (z|x)2. In this work, x represents θ S inputvisualinformation(i.e.images),andyistheoutputgrasp p (y|x)≈ 1 (cid:88)p (y|x,z(s)), (5) configuration as shown in Figure 3. θ S θ s=1 ThelowerboundoftheCVAEissimilartotheVAE,except z(s) ∼p (z|x) θ for the addition of a second source of given information: LCVAE(θ,φ;x,y)=−DKL(qφ(z|x,y)||pθ(z|x)) (4) p (y|x)≈ 1 (cid:88)S pθ(y|x,z(s))pθ(z(s)|x), (6) 1 (cid:88)L θ S qφ(z(s)|x,y) + logp (y|x,z(l)). s=1 L θ z(s) ∼q (z|x,y) l=1 φ The structure of each of these networks (recognition, gen- D. Grasp motor image erator, and prior) is a design choice, but following [27], we design both our recognition and generator network to have Consider, for example, transferring prior motor experience a convolutional (CNN-based) pathway from the image for to novel objects, where the selection of an initial grasp its numerous benefits, among them, reducing the number of was influenced by internal and external perceptual object free parameters. The typical way to evaluate such models is properties. One approach to transferring this knowledge could achieved through estimates of the conditional log-likelihood be to directly use past experience – existing in some high- (CLL), using either Monte Carlo (Equation 5) or importance dimensional space – to initialize priors within a system. A different approach, e.g. in MI, could be to access some latent, low-dimensional representation of past actions which 2The prior network is a technique for modulating the latent variables via someinputdata,andisusedinplaceofthepriorspecifiedfortheVAE. are shared among a variety of different situations. VERESetal.:MODELINGGRASPMOTORIMAGERYTHROUGHDEEPCONDITIONALGENERATIVEMODELS 5 Instead of learning a direct, image-to-grasp mapping fromthe20mostpopulatedclasses,containingaround47,000 through neural networks, we instead learn an (image-and- successful image/grasp pairs. From this set, we randomly grasp)-to-grasp mapping. Our approach is intuitive: based on select 10% to be used for validation during training. In order perceptualinformationaboutanobject,andanideaofhowan to study the effects of generating grasps for novel objects, object was previously grasped, we index a shared structure we partition this test set into two distinct groups: objects that of object-grasp pairs to synthesize new grasps to apply to are similar to those of the training set (i.e. belong to the same an object. As shown in Section III-C this scheme exists as classofobjectbutarenotthesameinstance),andtheotherset a single module and is trainable in a fully end-to-end manner. comprised of object classes never encountered during training Further, the structure of the CVAE model allows us to model (different). The final dataset statistics are reported in Table I. and generate grasps belonging to not only one, but possibly TABLE I: Overview of dataset statistics many different modes. Building the grasp motor image only requires that object #Objects #Instances properties are captured in some meaningful way. We use a Trainingfiles 161 42,351 single data-modality (i.e. visual information) which exploits Testingfiles-Similar 20 4,848 CNNmodulesforefficientandeffectivevisualrepresentation. Testingfiles-Different 53 8,851 Our continued interest lies in investigating how the GMI can be gradually learned through multiple modalities; specifically, 1) Visual information: Recorded for each successful grasp those that capture internal properties and require object inter- trial are RGB and Depth images (size 64×64 pixels), as well action (such as force and tactile data). as a Binary segmentation mask of equal size, indicating the object’s spatial location in the image. Each image collected uses a simulated Kinect camera with primary axis pointing IV. EXPERIMENTALSETUP towards the object (along the negative z-direction of the There are a few different directions that reasoning about manipulator’s palm), and the y-axis pointing upwards. This graspsynthesisusingGMIaffords.Duetoabstractionsatboth configuration means that the same image could correspond the object and action level, we hypothesize that the model to a number of different grasps, and allows us to capture should require fewer number of samples to run, and evaluate multimodality that may exist within the grasp space. this by restricting the amount of training data available to the 2) Definition of grasps: Our experiments leveraged the model. We also evaluate synthesized grasps on objects similar three-fingered Barrett hand, and defines a grasp as the 18- to those seen at training time, along with families of objects dimensional vector [p(cid:126) ,p(cid:126) ,p(cid:126) ,n(cid:126) ,n(cid:126) ,n(cid:126) ], where the subscript 1 2 3 1 2 3 the model has never seen before. denotes the finger, and the contact positions p and normals n each have (x,y,z) Cartesian components. While the contact A. Dataset positions specify where to place the fingertips of a gripper, the purpose of the contact normals is to describe the relative We collected a dataset of successful, cylindrical precision orientation. Note that with this parameterization, the unique roboticgraspsusingtheV-REPsimulator[23],andobjectfiles characteristics of a manipulator (i.e. number of joints or providedbyKleinhansetal.[8]onasimulated“picking”task. degrees of freedom) have been abstracted into the number Whilethereareseveralthree-fingeredgrippersbeingemployed of available fingertips and renders the representation as being forcommercialandresearchapplications(e.g.theReFlexhand gripper-agnostic. http://labs.righthandrobotics.com), a number of recent works Weencodeallgraspsintotheobject’sreferenceframe{O}, haveadoptedtheBarretthand[6],[8]3.Inthiswork,wefavour whichisobtainedbyapplyingPCAonthebinarysegmentation this gripper for its ability to capture a variety of different mask.WeassignthedirectionalvectorsO(cid:126) andO(cid:126) coincident z y grasps dependent on the shape of an object. with the first and second principal components respectively, The object files comprise a set of various everyday house- andensurethatO(cid:126) alwayspointsintotheobject.Wedefinethe x hold items, ranging in size from small objects such as tongs, object’scentroidasbeingthemeanx andy pixelcoordinates p p tolargerobjectssuchastowelsandvases.Eachoftheseobject of the object. classes has a unique ancestral template, and perturbations of these templates were performed to yield a number of meshes B. Learning with different shape characteristics. During data collection we We build our networks within the Lasagne framework [1], assumed that all objects are non-deformable, share the same using the Adam optimizer, a learning rate of 0.001, and frictionvalue,andsharethesamemassof1kg.Inasimulated a minibatch size of 100. Due to class imbalance in the setting, these assumptions allow us to direct our attention dataset (complicated by some objects being easier to grasp towardstheeffectsofvisualinformationontheGMIandforgo than others), we train with class-balanced minibatches. We properties that are better captured through e.g. tactile sensory standardize all data to have zero mean and unit-variance. systems. Fortherecognitionnetworkweusea5-layerCNN,applying From each object class, we remove 1 object and place it max pooling every 2 layers and using filter sizes of [7, 5, 5, into a test set. The training set is comprised of 161 objects 5, 3] and number of filters [16, 32, 32, 64, 64] respectively. Theoutputoftheconvolutionoperationsfeedsintoanetwork- 3Note that the Barrett hand is an underactuated, three-fingered gripper parameterizedthroughatotalof8jointsand4degreesoffreedom. in-network structure with average pooling [12] for reducing 6 IEEEROBOTICSANDAUTOMATIONLETTERS.PREPRINTVERSION.ACCEPTEDDECEMBER,2016 GSNN, on the other hand, uses a recognition network that only has information about input visual information x, and is unable to shift the prior mean based on the grasps y. Count Count TABLE II: Negative CLL for test sets composed of similar or different objects (relative to the training set). Y Z Value Value Similarobjects(n=4,848) Differentobjects(n=8,851) (a) Learned output distributions (b) Learned latent distributions Trainsize 16,384 32,768 42,351 16,384 32,768 42,351 Fig. 4: Histograms of output (normalized) and latent space CNN 24.721 24.833 24.577 26.910 26.920 26.599 samples for a single object-grasp instance. Histograms gener- GSNN 22.827 22.292 21.831 27.945 32.513 33.461 CVAE 15.325 13.531 13.216 18.356 18.808 16.525 ated by drawing 4000 samples from the R-CVAE network. R-CVAE 13.670 13.024 12.511 14.277 14.128 13.514 dimensionality of the system, and is subsequently followed B. Simulator by a 64-neuron hidden layer. The outputs from processing the images and grasps are fused into a shared representation, then To evaluate how the network predictions transfer back to followedwithanother64neuronhiddenlayerbeforeencoding the “picking” task, we evaluate a single prediction (again the latent representation. As a regularizer, we inject white using the distribution means for the stochastic networks) in noise at all network inputs with σ = 0.05 and apply weight the simulator. Given that the task is to pick up an object, normalization [24]. we define a grasp to be successful if the object is still held Asmentioned,inordertocomputep (y|x,z),weemploya within the gripper at the height of the picking operation. If θ prediction network for making an initial guess (i.e. p (y|x)), the object is no longer held by the fingertips, contacting other θ and add it to a prediction of p (y|z) from the generator components, or the gripper failed to contact the object during θ network using a sampled z (Figure 3a). Our prior network initial finger placement, the grasp is deemed a failure. follows a similar structure to the recognition network; in the In order to position the manipulator for each initial grasp, CVAEmodel,wedroptheinputy,andonlyprocessthevisual our grasp planner consists of calculating an optimal initial stream. In the recurrent CVAE (R-CVAE), we take an initial wrist placement by minimizing the distance of each of the guess made by the prediction network, and feed it back into manipulator’s fingers to the predicted target positions: the corresponding input in the prior network. N Two other models were tested. The Gaussian stochastic min (cid:88)(C −Y )2 (7) i i neural network (GSNN) [27], which is derived by setting α,β,γ,Tx,Ty,Tz i=1 the recognition and prior network equal (i.e. q (z|x,y) = θ where α,β,γ are the yaw, pitch, and roll rotational com- p (z|x)), and a baseline convolutional neural network (CNN) θ ponents, while T , T , T are the x, y, and z translational x y z that takes input images and tries to predict a grasp config- components. In this optimization, N is the number of finger- uration directly. For evaluating the CLL, we found that 100 tips, and C ,Y are the ground-truth and predicted fingertip i i samples for importance sampling (R-CVAE and CVAE), and positions relative to a common frame. The results for similar 1000samplesforMonte-CarloestimateswiththeGSNNwere and different objects can be seen in Table III. sufficient to obtain consistent estimates. TABLE III: Percentage of successful simulated grasps. V. RESULTS Similarobjects(n=4,848) Differentobjects(n=8,851) To emphasize the fact that the networks are learning dis- Trainsize 16,384 32,768 42,351 16,384 32,768 42,351 tributions over the data instead of deterministic predictions, CNN 0.155 0.202 0.199 0.106 0.138 0.145 Figure 4 presents a sample histogram of the latent and output GSNN 0.169 0.177 0.190 0.115 0.145 0.147 variable space for a single object-grasp instance. CVAE 0.344 0.346 0.347 0.302 0.301 0.295 R-CVAE 0.318 0.323 0.362 0.288 0.304 0.315 A. Conditional log-likelihood Two key results can be seen from this table. First, grasp TableIIpresentstheestimatednegativeCLLforeachofthe predictions made by the baseline CNN appear to be unable to model types, split between the two experimental paradigms: match that of the generative CVAE and R-CVAE models. It test sets composed of similar or different objects relative to is likely that this result stems from learning a discriminative the training set. The CLL scores for the CVAE and R-CVAE modelwithamultimodalsetting.PartialsuccessfortheCNN’s show that they significantly outperform the GSNN and CNN, predictions may also be due in part to a grasp planner that indicating a tighter lower-bound and better approximation to succeeds under fairly weak predictions. logp(x). This result could be due to the parameterization of Second,withrespecttodataefficiency,therelativegainsfor the prior distribution; in the CVAE and R-CVAE models, the thebaselineCNNmodel(betweenthe16kand42ktrainingset prior was modulated by the use of a prior network, allowing sizes) appears to be much greater then the generative CVAE the predicted means flexibility in shifting their inputs. The and R-CVAE models. The literature has reported advances VERESetal.:MODELINGGRASPMOTORIMAGERYTHROUGHDEEPCONDITIONALGENERATIVEMODELS 7 in supervised learning that have been able to leverage very REFERENCES large labeled training datasets, but the gains for unsupervised [1] S. Dieleman et al., “Lasagne: First release,” August 2015. [Online]. learning methods are less documented. Available:http://dx.doi.org/10.5281/zenodo.27878 [2] C.Finnetal.,“Deepspatialautoencodersforvisuomotorlearning,”in IEEEInternationalConferenceonRoboticsandAutomation,May2016. C. Multimodal grasp distributions [3] V.Frak,Y.Paulignan,andM.Jeannerod,“Orientationoftheopposition axisinmentallysimulatedgrasping,”ExperimentalBrainResearch,vol. InFigures5&6,wedemonstratethelearnedmultimodality 136,no.1,pp.120–127,2001. of our networks by sampling different grasps from a single [4] T.Hanakawa,M.A.Dimyan,andM.Hallett,“Motorplanning,imagery, andexecutioninthedistributedmotornetwork:atime-coursestudywith grasp instance4. In these figures, we present the input RGB functionalmri,”Cerebralcortex,vol.18,no.12,pp.2775–2788,2008. image(left),theplottedgraspconfiguration&object(middle), [5] M.Jeannerod,“Neuralsimulationofaction:aunifyingmechanismfor aswellasat-SNE[13]plotofthelearnedgraspspace(right). motorcognition,”Neuroimage,vol.14,no.1,pp.S103–S109,2001. [6] D.Kappler,J.Bohg,andS.Schaal,“Leveragingbigdataforgraspplan- t-SNE is a method for visualizing high-dimensional data by ning,” in IEEE International Conference on Robotics and Automation, projectingitdowntoalower-dimensionalspace.Intheseplots, May2015. one can clearly see distributions with multiple modes, which [7] D.P.KingmaandM.Welling,“Auto-encodingvariationalbayes,”Pro- ceedingsoftheInternationalConferenceonLearningRepresentations, in turn appear to be reflected in the plotted grasp space. 2014. [8] A. Kleinhans et al., “G3DB: a database of successful and failed grasps with RGB-D images, point clouds, mesh models and gripper VI. DISCUSSION parameters,”inInternationalConferenceonRoboticsandAutomation: WorkshoponRoboticGraspingandManipulation,2015. Further inspection of the grasp space in Figures 5 & 6 [9] I. Lenz, H. Lee, and A. Saxena, “Deep learning for detecting robotic appears to show that many of the grasps share a similar grasps,” The International Journal of Robotics Research, vol. 34, no. vertical distribution. We believe this may be a result of the 4-5,pp.705–724,2015. [10] S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end training data collection process, where reachability constraints prevent of deep visuomotor policies,” Journal of Machine Learning Research, certain grasps from being executed (e.g. on the lower part of vol.17,no.39,pp.1–40,2016. objects due to contact with the table). [11] S.Levine,P.Pastor,A.Krizhevsky,andD.Quillen,“Learninghand-eye coordinationforroboticgraspingwithdeeplearningandlarge-scaledata Wehaveidentifiedafewpotentiallimitationsofourcurrent collection,”InternationalSymposiumonExperimentalRobotics,2016. approach. First, our method still requires tens of thousands of [12] M.Lin,Q.Chen,andS.Yan,“Networkinnetwork,”Proceedingsofthe examples to train, which is expensive to collect in the real InternationalConferenceonLearningRepresentations,2014. [13] L.v.d.MaatenandG.Hinton,“Visualizingdatausingt-sne,”Journal world. Second, our evaluation strategy has only focused on ofMachineLearningResearch,vol.9,pp.2579–2605,2008. objects with fixed intrinsic properties, which is a simplifica- [14] J. Mahler et al., “Dex-Net 1.0: A cloud-based network of 3d objects tion of real-world characteristics. Compared to feed-forward for robust grasp planning using a multi-armed bandit model with correlatedrewards,”inIEEEInternationalConferenceonRoboticsand networks,asampling-basedapproachismorecomputationally Automation,May2016. expensive and there may be alternate ways of simplifying its [15] N. Mizuguchi, T. Yamagishi, H. Nakata, and K. Kanosue, “The effect computationalrequirements.Finally,therearealsootherprac- ofsomatosensoryinputonmotorimagerydependsuponmotorimagery capability,”Frontiersinpsychology,vol.6,p.104,2015. ticalconsiderationsthatcouldbetakenduringdata-collection, [16] L.Montesano,M.Lopes,A.Bernardino,andJ.Santos-Victor,“Learning such as optimizing the closing strategy of the manipulator for object affordances: From sensory–motor coordination to imitation,” e.g., more reactive grasping. IEEETransactionsonRobotics,vol.24,no.1,pp.15–26,Feb2008. [17] T. Mulder, “Motor imagery and action observation: cognitive tools for rehabilitation,” Journal of neural transmission, vol. 114, no. 10, pp. 1265–1278,2007. VII. CONCLUSION [18] J.Munzert,B.Lorey,andK.Zentgraf,“Cognitivemotorprocesses:the In this work we presented a conceptual framework for role of motor imagery in the study of motor representations,” Brain researchreviews,vol.60,no.2,pp.306–326,2009. robotic grasping, the grasp motor image, which integrates [19] K. Noda, H. Arie, Y. Suga, and T. Ogata, “Multimodal integration perceptual information and grasp configurations using deep learning of robot behavior using deep neural networks,” Robotics and generativemodels.Applyingourmethodtoasimulatedgrasp- AutonomousSystems,vol.62,no.6,pp.721–736,2014. [20] L.PintoandA.Gupta,“Supersizingself-supervision:Learningtograsp ing task, we demonstrated the capacity of these models to from50ktriesand700robothours,”inIEEEInternationalConference transfer learned knowledge to novel objects under varying onRoboticsandAutomation,May2016. amounts of available training data, as well as their strength [21] F. T. Pokorny, K. Hang, and D. Kragic, “Grasp moduli spaces,” in RoboticsScienceandSystems,2013. in capturing multimodal data distributions. [22] D. J. Rezende, S. Mohamed, and D. Wierstra, “Stochastic backprop- Our primary interest moving forward is in investigating agation and approximate inference in deep generative models,” in how objects with variable intrinsic properties can be learned Proceedingsofthe31stInternationalConferenceonMachineLearning, 2014. with the GMI, specifically, by introducing additional sensory [23] E. Rohmer, S. P. N. Singh, and M. Freese, “V-REP: A versatile modalities into the system. We are also interested in investi- and scalable robot simulation framework,” in IEEE/RSJ International gating how continuous contact with an object contributes to ConferenceonIntelligentRobotsandSystems,2013. [24] T. Salimans and D. P. Kingma, “Weight normalization: A simple the formation of the GMI. We find work within the cognitive reparameterization to accelerate training of deep neural networks,” in sciences on the effects of somatosensory input on motor AdvancesinNeuralInformationProcessingSystems,2016. imagery (c.f. [15]) to be an interesting starting point. [25] J. Sergeant, N. Su¨nderhauf, M. Milford, and B. Upcroft, “Multimodal deep autoencoders for control of a mobile robot,” in Australasian ConferenceforRoboticsandAutomation,2015. 4Theplottedgraspconfigurationistheresultofsolvingforthatgraspthat [26] N.SharmaandJ.-C.Baron,“Doesmotorimageryshareneuralnetworks maximizes: y∗ = argmaxy L1 (cid:80)Ll=1pθ(y|x,z(l)), using z(l) ∼ pθ(z|x) with executed movement: a multivariate fMRI analysis,” Frontiers in andL=50. humanneuroscience,vol.7,p.564,2013. 8 IEEEROBOTICSANDAUTOMATIONLETTERS.PREPRINTVERSION.ACCEPTEDDECEMBER,2016 Fig. 5: Sampled grasps from similar objects. Left: RGB image of object, Middle: Plotted grasp configurations (positions and normals), Right: t-SNE plot of learned grasp space with superimposed kernel density estimate (note multiple discrete modes). Fig. 6: Sampled grasps from different objects. Left: RGB image of object, Middle: Plotted grasp configurations (positions and normals), Right: t-SNE plot of learned grasp space with superimposed kernel density estimate (note multiple discrete modes). [27] K.Sohn,H.Lee,andX.Yan,“Learningstructuredoutputrepresentation movements,” in Advances in Neural Information Processing Systems, using deep conditional generative models,” in Advances in Neural 1993. InformationProcessingSystems,2015. [28] D.Song,C.H.Ek,K.Huebner,andD.Kragic,“Multivariatediscretiza- tionforbayesiannetworkstructurelearninginrobotgrasping,”inIEEE InternationalConferenceonRoboticsandAutomation,May2011. [29] J. Sung, I. Lenz, and A. Saxena, “Deep multimodal embedding: Manipulatingnovelobjectswithpoint-clouds,languageandtrajectories,” CoRR, vol. abs/1509.07831, 2015. [Online]. Available: http://arxiv.org/ abs/1509.07831 [30] Y. Uno, N. Fukumura, R. Suzuki, and M. Kawato, “Integration of visualandsomatosensoryinformationforpreshapinghandingrasping

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.