Table Of Content

A Framework for Learning Declarative Structure Stephen Hart, Shichao Ou, John Sweeney, and Rod Grupen Department of Computer Science University of Massachusetts Amherst, MA 01003 USA fshart, chao, sweeney, [email protected] Abstract—This paper provides a framework with which a humanoid robot can efficiently learn complex behavior. In this framework, a robot is rewarded by learning how to generate novel sensorimotor feedback—a form of native motivation. This intrinsic drive biases the robot to learn increasingly complex knowledge about itself and its effect on the environment. The frameworkincludesamechanismforuncoveringhiddenstatein a well-structuredstateand actionspace.We presentan example whereintherobot,Dexter,learnsasequenceofmanualskills:(1) searching for and grasping an object, (2) the length of its arms, and(3)howtoportrayitsintentionstohumanteachersinorder to induce them to help. I. INTRODUCTION Behavior can be decomposed into declarative and procedural components. The procedural structure captures knowledge Fig.1. Dexter,theUMassbi-manualhumanoid. about how to implement an abstract policy in a particular setting; e.g., use the left hand and an enveloping grasp [1]. The declarative structure captures abstract knowledge about for control that supports the representation of declarative the task; e.g., to pick up an object, we must first find the and procedural knowledge. Primitive actions are closed-loop object, reach to it, and then grasp it. In this paper, we present controllers that consist of an objective function (cid:30) 2 (cid:10) , a (cid:30) aframeworkforarobottoautonomouslylearnthedeclarative subset of available “sensor” resource abstractions (cid:27) 2 (cid:10) , (cid:27) structure. and a subset of available “effector” resource abstractions (cid:28) 2 The robot is motivated to explore new behavior in order to (cid:10) . A fully specified controller with its sensor and effector (cid:28) cause stimuli on multiple channels of sensorimotor feedback. resources is denoted (cid:30)(cid:27). The sequence of objective functions (cid:28) This formulation provides the robot with native reward for invoked captures the declarative structure of a task, while the transforming exterior phenomena into new patterns of interior (cid:27) and (cid:28) parameters capture procedural details appropriate for stimulation. As a result, the robot accumulates knowledge the run-time context. regarding itself and the environment using a representation In the control basis, concurrent control commands are con- grounded in patterns of sensory and motor signals. structed by projecting the output of a subordinate controller, In many cases, the value of actions is obscured by hidden (cid:30)2, into the nullspace of a higher priority controller, (cid:30)11. The state in the sensorimotor feedback. We present an algorithm nullspace N1 of the control command of (cid:30)1 is (I (cid:0)J1+J1) to resolve hidden state by finding sensorimotor variables where J is the Jacobian matrix of the objective with respect i that reduce entropy in the distribution of probable outcomes totheconfigurationvariables(cid:18)andJ+isitspseudoinverse[4]. produced by action. i Ourshorthandforthisnullspaceprojectioniswrittenusingthe In this paper, we employ a platform for experimental “subject-to” operator “/” [2]. For example, (cid:30)2 /(cid:30)1 captures researchinbi-manualmanipulationcalled“Dexter”toexplore the case where subordinate controller (cid:30)2 projects outputs into the development of behavior. Dexter has a 2-DOF pan/tilt the nullspace of the superior controller (cid:30)1. head equipped with two Sony cameras capable of stereo triangulationandtwo7-DOFwhole-armmanipulators(Barrett A. Discrete Event Dynamic Systems Technologies, Cambridge MA). Each arm is equipped with a The control basis also provides for a discrete state repre- 3-finger, 4 DOF Barret Hand and 6-axis finger-tip load cells. sentation that reflects the status of underlying control tasks. II. THECONTROLBASIS Closed-loopcontrollersprovideasymptoticallystablebehavior that is robust to local perturbations. The error dynamics Thearchitecturepresentedinthispaperemploysthecontrol basis framework for discrete event dynamic systems [2], [3]. 1Unlessotherwisenoted, we use (cid:30)i to representa particularinstantiation This approach is designed to provide a combinatoric basis ofacontrollerwithdefined objective,sensor,andeffectorresources. Report Documentation Page Form Approved OMB No. 0704-0188 Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information, including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington VA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if it does not display a currently valid OMB control number. 1. REPORT DATE 3. DATES COVERED 2006 2. REPORT TYPE 00-00-2006 to 00-00-2006 4. TITLE AND SUBTITLE 5a. CONTRACT NUMBER A Framework for Learning Declarative Structure 5b. GRANT NUMBER 5c. PROGRAM ELEMENT NUMBER 6. AUTHOR(S) 5d. PROJECT NUMBER 5e. TASK NUMBER 5f. WORK UNIT NUMBER 7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) 8. PERFORMING ORGANIZATION University of Massachusetts,Department of Computer REPORT NUMBER Science,Amherst,MA,01003 9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSOR/MONITOR’S ACRONYM(S) 11. SPONSOR/MONITOR’S REPORT NUMBER(S) 12. DISTRIBUTION/AVAILABILITY STATEMENT Approved for public release; distribution unlimited 13. SUPPLEMENTARY NOTES The original document contains color images. 14. ABSTRACT 15. SUBJECT TERMS 16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF 18. NUMBER 19a. NAME OF ABSTRACT OF PAGES RESPONSIBLE PERSON a. REPORT b. ABSTRACT c. THIS PAGE 5 unclassified unclassified unclassified Standard Form 298 (Rev. 8-98) Prescribed by ANSI Std Z39-18 of a controller (e, e_) support a natural discrete abstraction of the continuous underlying state space. According to this 1 1 discreteabstraction,controleventsaredefinedbyrecognizable 0 1 3 temporal patterns in the control error. 0 0 1 0 X 1 2 The dynamic state of a controller (cid:30) can be characterized 0 −1 1 0 i by a predicate p . In this paper we define three cases: 1 i (cid:0)1 : ei is undefined 1 −1 SEARCH−TRACK 1 −1 GRASP 8 0 p = 0 : e_ <(cid:15) (1) i < i (cid:30) 1 : e_ (cid:21)(cid:15) ; i (cid:30) : (a) (b) where (cid:15) is some small negative constant. A value of X for (cid:30) predicate p characterizes an aggregate “don’t care” value i Fig.2. Twosimplesensorimotorschemasconstructedfromprimitiveclosed over the three possible values and is often useful for state loopcontrollers:(a)showstheSEARCH-TRACKschemathatfinds andtracksa abstraction. For asymptotically stable controllers, for which visualfeatureusingcontrollers(cid:30)0and(cid:30)1;and(b)showstheGRASPschema e(t) can be considered a Lyapunov function, e_ is negative thatmovesthefingers ofahandtoaclosedpositionusingcontrollers(cid:30)2and (cid:30)3.Ifanobjectispresent,thefingers willenvelopthatobjectandtakeaction definite. A “schema” designates a set of possible actions and (cid:30)3 tograspit. a policy for selecting actions on the basis of patterns in the feedback that they reveal. Two such schema will be used in the experiments to follow: SEARCH-TRACK and GRASP. sensorimotor feedback. Equipped with this function, a robot SEARCH-TRACK is a schema constructed out of two primi- will be biased toward behavior that causes the environment to tive, closed-loop controllers. Position controller (cid:30)0 moves a changeindiscernibleways.Ingeneral,thisfunctionrewardsa pan/tilt stereo head to a random configuration drawn from robotforbringingaboutcontrolandsensoryeventsthatsignify a context sensitive probability distribution. Another position the status of useful objectives. controller, (cid:30)1, moves a visual feature to the image center. For the experiments presented in this paper, a simple func- Bothoftheseobjectivesareaddressedbyactuatingthepan/tilt tion is used pertaining to only a few possible sensory cues: a motors of the stereo head. Together, they can be used to positiveunitrewardisgiventotherobotwhennewforeground constructavarietyofbehavior.StatesS =(p0;p1)andactions regionsappearinfrontoffixedbackgrounds(aformofmotion (cid:30)0and(cid:30)1defineaMarkovDecisionProcess(MDP).Acontrol cue) and also when tactile events occur on the robot’s finger- policy can be defined over this MDP that finds and tracks a tip sensors. A penalty of -0.2 was given for each action taken visual feature. Figure 2(a) shows the relevant states and non- to decrease the number of steps taken in each trial. zero state transitions for this schema. This schema results in a policy which moves the head to random pan/tilt locations C. Hierarchical Composition of Behavior by executing (cid:30)0 until a feature is found. When this event Behavioral schemas like those defined in Figures 2(a) occurs, the feature is tracked in the center of the image plane. and2(b)canbereusedhierarchically.Wewillusethenotation The transition from state (1;0) to state (1;(cid:0)1) captures the (cid:8) to represent a policy used as an action. For each behavior situation in which the visual feature “outruns” the tracking (cid:8) , let O (cid:18)S be the set of absorbing states, and A be the controller. i i i i action set of available controllers and schema. When in state Figure2(b)showsaGRASPschemathatemploysactions(cid:30)2 s, the dynamic state of (cid:8) is represented as a predicate p , and(cid:30)3 activatedonstatesS =(p2;p3).Positioncontroller(cid:30)2 i i according to the following three cases: flexesthefingersoftherobothandtowardareference“closed” configuration. Force controller, (cid:30)3, applies a reference force (cid:0)1 : ej is undefined 8(cid:30)j 2Ai 8 at each fingertip contact so as to generate a wrench closure p = 0 : s2= O (2) i < i condition that defines a grasp. Assuming that the hand begins 1 : s2O : i inthe“open”configuration,Figure2(b)specifiesthatthehand : Addinghierarchyinthiswayallowsforthetransferofbehav- close through the execution of (cid:30)2. In the course of closing (cid:30)3 ior across tasks as well as an increased efficiency in learning. will either assert state (1,-1) if there is no object to inhibit Coarticulation techniques for executing behavior schema con- the movement, state (1,0) if contact with an object is made currentlywithotherschemaorprimitivecontrollershavebeen but the force condition is not met, or state (1,1) if an object developed [5]. is encountered and the force condition is achieved. Under the right conditions, the GRASP schema yields grasps that allow D. Learning Control Sequences the hand to move the target object. Note that states (1;(cid:0)1) Reinforcement learning techniques such as Q-learning [6] and (1;1) are absorbing in this policy because there are no can be employed to choose actions that allow an agent to transitions out of them. cause rewarding transitions from one state to another [2], [7]. B. Reward Structure Q-learningestimatestheaction-valuefunctionQ(s;a)foreach To guide the behavior of the robot, an intrinsic reward state-action pair (s;a), where s 2 S and a 2 A. Values functionisusedtoencouragemulti-modaleventsintherobot’s are estimated by calculating the expected sum of discounted reward for each state-action pair. This update rule is defined Algorithm 1 Uncovering Hidden State as: 1: Use Q-learning to find the optimal policy (cid:25)(cid:3) given the current state representation. 0 0 Q(s;a) Q(s;a)+(cid:11)(cid:0)r(s;a)+(cid:13) maxa0Q(s;a)(cid:0)Q(s;a)(cid:1) 2: Use (cid:25)(cid:3) until it is determined that the entropy of a particular s(cid:2)a transition distribution is high. where (cid:13) 2 [0;1] is the discount rate and (cid:11) 2 [0;1] is the learning rate. Reward function r is described in section II- 3: Using stochastic exploration with (cid:25)(cid:3), complete a number of trials and gather a data set D, where each d 2 D is a B. With sufficient experience, this estimate is guaranteed to converge to the true value Q(cid:3). A policy (cid:25) maps states to tuple (F~;l). actions. The optimal policy (cid:25)(cid:3) maximizes the expected sum 4: Using D, calculate a decision boundary, g, in feature of future reward such that: space F. Add a new controller (cid:30)n+1 that attends to the features of the decision boundary g(F~). Append the cor- (cid:3) (cid:3) (cid:25) (s)=maxaQ (s;a) responding predicate pn+1 to S to form the new factored state representation. E. Uncovering Hidden State in the MDP 5: Go to step 1. Ingeneral,itisnotpossibletoknowallthestateinformation requiredtomakeoptimal,deterministiccontroldecisionsfora task a priori. However, in may cases hidden state is encoded A. Dexter’s Native Structure in the observable sensorimotor features available to a robot. If we assume that the uncertainty in executing actions on Inthefollowingexperiments,theobjectpresentedtoDexter the robot is small, then the main cause of stochasticity in isasmalltoyorangebasketball,seenwithDexterinFigure1, the state transitions is due to factors that are not captured approximately5inchesindiameterandmadeoutofsoftfoam. in the current state description. Therefore, an increase in the A simple enveloping grasp reliably acquires the basketball. entropy of the state transitions indicates that the environment During each trial, Dexter acquires a feature vector F~ 2 F has changed in a way that is unmodeled in our current state of observable sensory signals. Backgroundsubtraction is used representation. By attending to features that reduce entropy, to segment foreground and background regions in Dexter’s andincorporatingthemintothestaterepresentation,therobot’s field of view. Foreground regions indicate motion relative to actionsbecomemoredeterministic.Inthissection,weprovide a fixed background. This kind of feature is determined to be alearningalgorithmthatmonitorstheentropyofthetransition attractive by design; it is a native, motion-based saliency cue probabilitydistributionandinspectstheavailablefeaturespace that bootstraps Algorithm 1. F contains the first and second of sensorimotor data to uncover hidden state. When relevant momentsofforegroundpixeldistributions,thespatiallocation informationisfound,apredicatecorrespondingtothedynamic oftheforegroundobjectsrelativetoDexter’scoordinateframe, stateofacontrollerthattracksthatinformationisaddedtothe and the tactile response for each of Dexter’s 6-axis finger-tip state representation. load cells. LetS =(p0;p1;:::;pn)bethefactoredstaterepresentation We provided the robot with five native controllers: (cid:30)0, (cid:30)1, of control predicates. Let F~ 2 F be a feature vector of (cid:30)2 and (cid:30)3 from Section II-A—instantiated with the robot’s availablesensorimotordata.Letthegoalstateset G (cid:17)fs2S pan/tilt head and hands—and position controller (cid:30)4, which s.t. r(s;a) > 0g. We can assign a label l 2 f+;(cid:14)g to an achieves a “Reach” behavior implemented with one of the episode according to whether it was rewarding or not as robot’sarms.Wealsoprovidetherobotwiththetwoschemas, follows. If the agent reaches an absorbing state that is also SEARCH-TRACK and GRASP, presented in Figure 2. a goal state s 2 G, we assign the label l = +. If the agent We also provided Dexter with (prior learned) procedural reaches an absorbing state s 2 O that is not a goal state, details for allocating controller resources (cid:27) and (cid:28) [1]. When we assign the label l = (cid:14). A general outline of the process theobjectisplacedontherobot’sleftside,theleftarmisused usedat each learningstageto uncover hiddenstateis given in to reach. When the object is placed on the robot’s right side, Algorithm 1. the right arm is used to reach. It is the goal of the following experiments to learn declarative knowledge—how to choose III. LEARNINGTRAJECTORIES actionsthatwillreliablyleadtothemostaccumulatedreward. In this section, we present experimental results that show B. Stage 1: Search-Reach-Grasp Behavior Dexterlearningtosearchfor,reach,andtouchobjects.Atfirst, a human teacher presents an object to Dexter that is always The robot is provided with the action set A = within reach. In this setting, Dexter quickly learns a policy fSEARCH-TRACK;GRASP;(cid:30)4g. Providing the two schemas— thatwilltouchtheobject.Later,whentheobjectissometimes ratherthanthecorrespondingprimitivecontrollers—allowsfor placedoutofreach,Dexterlearnstoavoidpenaltiesassociated encoding useful prior knowledge of the task. We will use a with actions by not reaching when the target is out of reach. three-predicatestatedescriptionS =(pst;pg;p4),correspond- Finally, Dexter learns that it can often “request” the human ingtothedynamicstatusof SEARCH-TRACK,GRASP,and(cid:30)4. teachertobringtheobjectwithinitsreachbyexecutingcertain These three predicates provide a preliminary representation actions with communicative content. intended to guide the robot toward grasping behavior. In the first experiment, consisting of 30 episodes, the robot which the position reference is the set of robot configurations learnsapolicytoacquirethebasketballusingQ-learningwith where rewarding “touches” are forthcoming, i.e., x<1:09m. an (cid:15)-greedy exploration strategy, where (cid:15)=0:1. As described With this addition, the robot is free to query the controller’s in section II-B, a reward signal of 1 was given to the robot state to establish the “reachable” states and modify its policy both when the object was found visually and when it was accordingly. The conjunction of the assertions (pst;p5) now touched. A penalty of -0.2 was also given for each action resolves important hidden state about the task. taken.AsseeninFigure3,ittakestherobotapproximatelyten trials to learn the optimal policy of (cid:25)(cid:3) = SEARCH-TRACK! “Reach” ! GRASP, which corresponds to searching for the 1.6 Stage 1 Stage 2 object, reaching to it, and grasping it. In this figure, the total 1.4 Total Return Task Return accumulated reward for each episode is shown as well as the 1.2 “task”-reward (what we know as the “goal” event of grasping ard 1 the object, plus the penalty). Episodes 30-35 show a testing ew 0.8 R phase in which exploration was turned off. e v 0.6 ati mul 0.4 1.6 Stage 1 TToatsakl RReettuurrnn Accu 0.2 1.4 0 1.2 −0.2 1 −0.4 mulative Reward 000...468 0 10 20 30 Ep40isodes50 60 70 80 Accu 0.2 Fstiagg.e42.,thAecocbujmecutliastepdlarceewdaordutsfiodretohfeDfierxsttert’wsowostrakgsepsacoef5l0e%arnoinfgt.heDtuimrineg. 0 Whenthepredicatep5 isadded,therobotlearnsnottoreachiftheobjectis toofaraway.Asaresult,averagerewardincreases. −0.2 −0.4 Figure4showshowDexter,after30moreepisodes(45-75), 0 5 10 15 20 25 30 35 learns that if the object is too far away, not reaching is more Episodes rewarding on average. Episodes 75-85 show a testing phase Fig. 3. Accumulated reward for Dexter learning search-reach-grasp policy in which exploration is turned off. If the object is reachable, averaged over 4 runs. The solid line shows the return for the touch reward Dexterwilltouchtheobjectforareward.Butwhentheobject andthepenalty.Thedottedlineshowsthetotalreturnforthesearchreward, thetouchreward,andthepenalty.Dexterlearns(cid:25)(cid:3) inabouttenepisodes. is out of reach it has learned not to incur the penalty from a failedreachattempt.However,Dexterisunable,giventheless constrained context, to achieve as much reward, on average, C. Stage 2: Learning Arm Length as was possible in stage 1. Once a policy was acquired for grasping the object, ten episodes (35-45) were performed in which, half the time, the object was placed out of Dexter’s reach—a context that had −0.1 Success positions Failure positions never occurred before. Figure 4 shows how, using the policy learned reachable boundary fromstage1,theaveragerewarddropstoabouthalfitsformer m)−0.15 Y ( value. During episodes that Dexter could not reach the object, n thenegativepenaltyfortakingactionsstillaccumulated,butno ositio −0.2 P final pay off for completing the task was rewarded. However, nt −0.25 e m Dexter still gathers reward for finding the object visually. g e S −0.3 Although position information is present in the sensory d n u feedback,therobotisnotattendingtoit.Thecurrentstaterep- gro−0.35 e resentation, therefore, does not capture information pertaining For to the “reachability” of the object. However, by computing −0.4 a new decision boundary in F, this state can be recovered −0.45 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 and added to the state description; thus allowing the robot to Foreground Segment Position X (m) learnnottoreachiftheobjectisunreachable.Touncoverthis information, the f+;(cid:14)g labels of episodes 35-45 were used Fig.5. ThexypositionsoftheobjectforthetrialsseeninFigure4,episodes to separate F using a support vector machine (SVM) [8]. 35-45.Thedecisionboundaryatx=1:09mbestseparatesthef+;(cid:14)gclasses. Thisdatasetwasgeneratedbyconstrainingtherobottouseonlyitsrightarm. Figure 5 shows the xy position of the object for these The learned boundary was transfered, through symmetry, to cases in which episodes,showingtheclearcutatx=1:09meters.Giventhis theproceduralcontextdictatedthatDexteruseitsleftarm. information,anewpositioncontroller(cid:30)5 canbeconstructedin D. Stage 3: Learning When Humans Can Help 50 Human 45 Basketball teaIcnhethreapnpeexatrsstiangeDseoxftelre’sarfineinldg,osfeveinewindFuirginugreea6c,haehpuismodane mponent40 Co aadnddetdhetochtahreactfeeraitsutircesvoefctthoerrFes.ulItningepfiosroedgersou8n5d-1s1eg5m, eDnetxatreer ment 2nd 3305 Mo cexovpelorsreasnsetowchcaosntitcexvtairniawtihoincshoitfctahne csrteaagtee2a cpoonlitcinygaenndcydfiosr- ment 2nd 2205 tnthooeuwcohrbienjaegcchtaencslootobsewjer.acrtTdthshiaastnaisoctucitounorrfecrnoetnalyvcehoyuostbtojhefecrtr,eoatbhcoeht.t’esWaichnheteenrnDtmioeonxvteteosr Foreground Seg11055 thehuman.ThisnewcontexttriggersasearchinAlgorithm1, 0 0 5 10 15 20 25 findingthatthepresenceofasecond,largerforegroundobject Foreground Segment 2nd Moment 1st Component (as seen in Figure 7) often distinguishes the case where Fig.7. Theforegroundobjects’secondmoment(covariance)whenthehuman unreachable objects are ultimately graspable. ispresent.Thesevaluesrepresentthefirst twoprinciplecomponentsofpixel A new position controller (cid:30)6 is configured to test the distributionsontheleftimageplane.Thedistinctclusterscorrespondtothe basketballandthenearbyhuman. assertion that a large foreground object is present. In episodes 115-145, Dexter learns a new policy with the predicate p6 added to the state description. Here, another unit of reward communicative actions—but many questions pertaining to the is received when Dexter finds a second foreground segment. fullexpressivepowerofthemethodologyremainunanswered. This policy differs from the policy learned in stage 2 in It is possible that Algorithm 1 mayb introduce bias into that it causes Dexter to reach for unreachable objects in the the system as the teacher “directs” the robot’s attention to presence of features associated with a nearby human, but not particular sensorimotor features. However, rather than being a otherwise. For episodes 145-155, exploration is turned off to hindrance, we believe that this approach allows the teacher showthenoticeableaccumulationofmoreaveragerewardthen to convey meaningful information to the robot, while still instage2.Notethat,instage3,Dexterlearnsapolicythat,on allowing the robot to discern exactly how this information is average, accumulates more reward than in stage 1. However, related to its sensorimotor signals. Dexter does not achieve the level of “goal” reward received For future work, we wish to examine the tractability of under the constrained conditions of the first stage when the learning declarative and procedural structure simultaneously. object was always placed within Dexter’s reach. We also wish to explore, in more detail, both the autonomous acquisition of hierarchical behavior and the formulations multi-modalrewardsignals.Astudyofconvergenceproperties 2.5 of the overall system remains to be made. Total Return Stage 2 Stage 3 Task Return 2 ACKNOWLEDGMENTS Stage 1 This research was supported under contract numbers ARO ward 1.5 W911NF-05-1-0396 and NASA NNJ05HB61A-5710001842 e e R andunderNASAGraduateStudentResearchProgramfellow- ulativ 1 ship NNJ05JG73H. um REFERENCES Acc 0.5 [1] S. Hart, R. Grupen, and D. Jensen, “A relational representation for proceduraltaskknowledge,”inProceedingsoftheAmericanAssociation 0 forArtificialIntelligenceConference,2005. [2] M.Huber,“Ahybridarchitectureforadaptiverobotcontrol,”Ph.D.dis- −0.5 sertation,DepartmentofComputerScience,UniversityofMassachusetts 0 50 100 150 Episodes Amherst,2000. [3] J. A. Coelho, “Multifingered grasping: Grasp reflexes and control context,”Ph.D.dissertation,DepartmentofComputerScience,Universityof Fig. 6. Accumulated reward for all three stages of learning. When the MassachusettsAmherst,2001. predicate p6 is added to the state representation, Dexter can exploit the [4] Y. Nakamura, Advanced Robotics: Redundancy and Optimization. presenceoffeaturesassociatedwithhumanstoachievemorereward. Addison-Wesley,1991. [5] K. Rohanimanesh and S. Mahadevan, “Coarticulation:An approach for IV. DISCUSSIONANDCONCLUSIONS generating concurrent plans in markov decision processes,”in Proceed- ingsofthe22ndInternationalConferenceonMachineLearning(ICML- 2005),Bonn,Germany,2005. Wehavepresentedaframeworkthatprovidesacontroltheo- [6] D. Watkins and P. Dayan, “Q-learning,”Machine Learning, vol. 8, no. reticbasisforaction,thehierarchicalcompositionofbehavior, 3,4,pp.279–292,1992. an intrinsic drive towards skill acquisition, and a means for [7] R. Sutton and A. Barto, Reinforcement Learning. Cambridge, Mas- sachusetts:MITPress,1998. uncovering hidden state. Promising preliminary results have [8] T. Joachims, “Learningto classify text using support vector machines,” shown how Dexter achieves significant developmental mile- Ph.D.dissertation,DepartmentofComputerScience,CornellUniversity, stones operating under this framework—including primitive 2002.