Learning Articulated Motions From Visual Demonstration by Sudeep Pillai Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Master of Science in Computer Science and Engineering at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY June 2014 (cid:13)c Massachusetts Institute of Technology 2014. All rights reserved. Author .............................................................. Department of Electrical Engineering and Computer Science Apr 30, 2014 Certified by.......................................................... Seth Teller Professor Thesis Supervisor Accepted by......................................................... Leslie Kolodziejski Chairman, Department Committee on Graduate Students To my parents 3 Learning Articulated Motions From Visual Demonstration by Sudeep Pillai Submitted to the Department of Electrical Engineering and Computer Science on Apr 30, 2014, in partial fulfillment of the requirements for the degree of Master of Science in Computer Science and Engineering Abstract Robots operating autonomously in household environments must be capable of in- teracting with articulated objects on a daily basis. They should be able to infer each object’s underlying kinematic linkages purely by observing its motion during manip- ulation. This work proposes a framework that enables robots to learn the articulation in objects from user-provided demonstrations, using RGB-D sensors. We introduce algorithms that combine concepts in sparse feature tracking, motion segmentation, object pose estimation, and articulation learning, to develop our proposed framework. Additionally, our methods can predict the motion of previously seen articulated ob- jects in future encounters. We present experiments that demonstrate the ability of our method, given RGB-D data, to identify, analyze and predict the articulation of a number of everyday objects within a human-occupied environment. Thesis Supervisor: Seth Teller Title: Professor 5 Acknowledgments I would like to thank my advisor, Seth Teller, for the countless sessions of advice and ideas we have had. His perspectives on tackling more challenging, and real- world problems have constantly pushed my limits to dream bigger. I would like to especially thank Matt Walter and Sachi Hemachandra for their invaluable feedback, and collaboration on several aspects of my research. Many thanks to Mike Fleder, Jon Brookshire, David Hayden, Ross Finman, Nick Wang, and William Li for their thoughts and ideas over several discussions. I would like to especially thank my parents for the unconditional love and endless support they have provided throughout my life. I thank my brothers, and their wives for being a constant source of inspiration, and support. To my two nephews, and soon-to-be-born niece, you bring a huge smile to my face every time I think of you; I couldn’t ask for more. Finally, I would like to thank my friends; being surrounded by their compassion and laughter has made my experience here at MIT all the more enjoyable. 7 Contents 1 Introduction 18 1.1 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 1.1.1 Learning from Visual Demonstration . . . . . . . . . . . . . . 20 1.1.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 21 1.1.3 Experiments and Analysis . . . . . . . . . . . . . . . . . . . . 21 1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2 Related Work 24 2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3 Articulation Learning from Visual Demonstration 30 3.1 Spatio-Temporal Feature Tracking . . . . . . . . . . . . . . . . . . . . 32 3.1.1 Trajectories via Interest-Point Tracking and Learning . . . . . 33 3.2 Motion Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.2.1 Trajectory Matching . . . . . . . . . . . . . . . . . . . . . . . 42 3.2.2 Trajectory Clustering . . . . . . . . . . . . . . . . . . . . . . . 45 3.3 Multi-Rigid-Body Pose Optimization . . . . . . . . . . . . . . . . . . 50 3.3.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.3.2 Pose Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.3.3 Pose Optimization . . . . . . . . . . . . . . . . . . . . . . . . 52 3.4 Articulation Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.4.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.4.2 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . 54 8 CONTENTS 3.4.3 Candidate Models and Model Fitting . . . . . . . . . . . . . . 55 3.4.4 Structure Selection . . . . . . . . . . . . . . . . . . . . . . . . 56 3.5 Learning to Predict Articulation . . . . . . . . . . . . . . . . . . . . . 58 3.5.1 Object Instance Recognition and Prediction . . . . . . . . . . 58 3.5.2 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . 61 4 Experiments and Analysis 65 4.1 Data and Experimental Setup . . . . . . . . . . . . . . . . . . . . . . 65 4.2 Algorithm Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.2.1 Learning with Simulated Data . . . . . . . . . . . . . . . . . . 66 4.2.2 Learning with RGB-D sensors . . . . . . . . . . . . . . . . . . 76 4.2.3 Articulation Prediction . . . . . . . . . . . . . . . . . . . . . . 86 5 Conclusion 90 A Implementation 92 9 List of Figures 1-1 Theproposedframeworkreliablylearnstheunderlyingkinematicmodel of articulated objects from user-provided visual demonstrations, and subsequently predicts their motions at future encounters, from novel vantage points. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3-1 The architecture of our system for articulation learning from visual demonstration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3-2 The training phase. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3-3 The prediction phase. . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3-4 Spatio-Temporal Feature Tracking steps combine traditional dense op- tical flow methods with feature detection and matching techniques to construct long range trajectories with little to no drift. . . . . . . . . 33 3-5 Trajectories constructed using KLT [29] (left), Dense Trajectories [34] (middle), and our algorithm (right). While all 3 algorithms show rea- sonably good trajectory construction, our algorithm exhibits negligible drift while achieving longer trajectories. . . . . . . . . . . . . . . . . . 34 3-6 Illustration of feature detection and prediction for a checkerboard pat- tern undergoing motion. (a) Feature detections (via GFTT) and fea- ture predictions (via Dense Optical Flow) are indicated by blue points and blue edges respectively. (b) Gray points indicate the feature detec- tions in the subsequent frame. Gray circles indicate the regions where feature predictions continue to be valid. . . . . . . . . . . . . . . . . . 36 10
Description: