ebook img

Interactive Learning from Policy-Dependent Human Feedback PDF

0.32 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Interactive Learning from Policy-Dependent Human Feedback

Interactive Learning from Policy-Dependent Human Feedback James MacGlashan∗, Mark K Ho†, Robert Loftin‡, Bei Peng§, David Roberts‡, Matthew E. Taylor§, Michael L. Littman† ∗Cogitai, †Brown University, ‡North Carolina State University, §Washington State University Abstract—For agents and robots to become more useful, they policytheagentshouldbefollowing,ratherthanasanumeric must be able to quickly learn from non-technical users. This value that is meant to be maximized by the agent. While this paperinvestigatestheproblemofinteractivelylearningbehaviors insight seems accurate, existing approaches assume models of 7 communicated by a human teacher using positive and negative feedback that are independent of the policy the agent is cur- 1 feedback. Much previous work on this problem has made the 0 assumption that people provide feedback for decisions that is rentlyfollowing.Wepresentempiricalresultsthatdemonstrate 2 dependent on the behavior they are teaching and is independent that this assumption is incorrect and further argue that policy- from the learner’s current policy. We present empirical results dependent feedback enables effective training strategies, such n that show this assumption to be false—whether human trainers a as differential feedback and policy shaping, from which we give a positive or negative feedback for a decision is influenced J would like a learning agent to benefit. Following this result, by the learner’s current policy. We argue that policy-dependent 1 feedback, in addition to being commonplace, enables useful we present Convergent Actor-Critic by Humans (COACH), an 2 training strategies from which agents should benefit. Based on algorithmforlearningfrompolicy-dependenthumanfeedback this insight, we introduce Convergent Actor-Critic by Humans that is capable of benefiting from these strategies. COACH is ] (COACH),analgorithmforlearningfrompolicy-dependentfeed- I based on the insight that the TD-error used by actor-critic A backthatconvergestoalocaloptimum.Finally,wedemonstrate algorithms is an unbiased estimate of the advantage function, that COACH can successfully learn multiple behaviors on a s. physical robot, even with noisy image features. which is a policy-dependent value roughly corresponding to c how much better or worse an action is compared to the [ I. INTRODUCTION current policy and which captures these previously mentioned 1 Programmingrobotsisverydifficult,inpartbecausethereal trainingstrategies.TovalidatethatCOACHscalestocomplex v world is inherently rich and—to some degree—unpredictable. problems,wetrainfivedifferentbehaviorsonaTurtleBotrobot 9 Inaddition,ourexpectationsforphysicalagentsarequitehigh that makes decisions every 33ms from noisy image features 4 and often difficult to articulate. Nevertheless, for robots to that are not visible to the trainer. 0 have a significant impact on the lives of individuals, even 6 0 non-programmers need to be able to specify and customize II. BACKGROUND . behavior. Because of these complexities, relying on end- For modeling the underlying decision-making problem of 1 0 userstoprovideinstructionstorobotsprogrammaticallyseems an agent being taught by a human, we adopt the Markov 7 destined to fail. Decision Process (MDP) formalism. An MDP is a 5-tuple: 1 Reinforcement learning (RL) from human trainer feedback (cid:104)S,A,T,R,γ(cid:105), where S is the set of possible states of the : v provides a compelling alternative to programming because environment; A is the set of actions available to the agent; i agents can learn complex behavior from very simple positive T(s(cid:48)|s,a) is the transition function, which defines the proba- X and negative signals. Furthermore, real-world animal training bility of the environment transitioning to state s(cid:48) when the ar is an existence proof that people can train complex behavior agent takes action a in environment state s; R(s,a,s(cid:48)) is using these simple signals. Indeed, animals have been suc- the reward function specifying the numeric reward the agent cessfullytrainedtoguidetheblind,locateminesintheocean, receivesfortakingactionainstatesandtransitioningtostate detect cancer or explosives, and even solve complex, multi- s(cid:48); and γ ∈ [0,1] is a discount factor specifying how much stage puzzles. immediate rewards are preferred to more distant rewards. However,traditionalreinforcement-learningalgorithmshave A stochastic policy π for an MDP is a per-state action yielded limited success when the reward signal is provided probability distribution that defines a mode of behavior; π : (cid:80) by humans and have largely failed to benefit from the so- S ×A → [0,1], where π(s,a) = 1,∀s ∈ S. In the a∈A phisticated training strategies that expert animal trainers use MDP setting, the goal is to find the optimal policy π∗, which with animals. This failure has led to the development of new maximizes the expected future discounted reward when the reinforcement-learning algorithms that are designed to learn agent selects actions in each state according to π∗; π∗ = from human-generated rewards and investigations into how argmax E[(cid:80)∞ γtr |π], where r is the reward received at π t=0 t t people give interactive feedback [1, 2, 3, 4, 5, 6]. In general, timet.TwoimportantconceptsinMDPsarethevaluefunction manyofthesehuman-centeredlearningalgorithmsarebuilton (Vπ) and action–value function (Qπ). The value function the insight that people tend to give feedback that reflects the defines the expected future discounted reward from each state when following some policy and the action–value function defines the expected future discounted reward when an agent takes some action in some state and then follows some policy π thereafter. These equations can be recursively defined via the Bellman equation: Vπ(s) = (cid:80) π(s,a)Qπ(s,a) and a Qπ(s,a) = (cid:80) T(s(cid:48)|s,a)[R(s,a,s(cid:48))+γVπ(s(cid:48))]. For short- s(cid:48) hand, the value functions for the optimal policies are usually denoted V∗ and Q∗. In reinforcement learning (RL), an agent interacts with an environment modeled as an MDP, but does not have direct access to the transition function or reward function and instead must learn a policy from environment observations. A common class of RL algorithms are actor-critic algorithms. Bhatnagar et al. [7] provides a general template for these Fig.1. ThetraininginterfaceshowntoAMTusers. algorithms.Actor-criticalgorithmsarenamedforthetwomain components of the algorithms: The actor is a parameterized We provide empirical results that show this assumption to be policy that dictates how the agent selects actions; the critic incorrect. estimatesthevaluefunctionfortheactorandprovidescritiques ateachtimestepthatareusedtoupdatethepolicyparameters. IV. POLICY-DEPENDENTFEEDBACK Typically, the critique is the temporal difference (TD) error: Evidence that human feedback is influenced by the agent’s δt =rt+γV(st)−V(st−1),whichdescribeshowmuchbetter current policy can be seen in previous work. For example, or worse a transition went than expected. it was observed that trainers taper their feedback over the course of learning [11, 12, 9]. One explanation for decreasing III. HUMAN-CENTEREDREINFORCEMENTLEARNING feedbackratesispolicy-dependentfeedback,buttrainerfatigue isanother.Weprovideastrongerresultshowingthattrainers— In this work, a human-centered reinforcement-learning for the same state–action pair—choose positive or negative (HCRL) problem is a learning problem in which an agent is feedback depending on their perception of the learner’s be- situatedinanenvironmentdescribedbyanMDPbutinwhich havior. This finding serves as a warning for algorithms that rewards are generated by a human trainer instead of from a rely on an assumption policy-independent feedback. stationary MDP reward function that the agent is meant to maximize. The trainer has a target policy π∗ they are trying A. Empirical Results to teach the agent. The trainer communicates this policy by WehadAmazonMechanicalTurk(AMT)participantsteach giving numeric feedback as the agent acts in the environment. an agent in a simple sequential task, illustrated in Figure 1. The goal of the agent is to learn the target policy π∗ from the Participants were instructed to train a virtual dog to walk to feedback. theyellowgoallocationinagridworldasfastaspossiblebut To define a learning algorithm for this problem, we first without going through the green cells. They were additionally characterize how human trainers typically use numeric feed- told that, as a result of prior training, their dog was already back to teach target policies. If feedback is stationary and either “bad”, “alright”, or “good” at the task and were shown intended to be maximized, it can be treated as a reward examples of each behavior before training. In all cases, the function and standard RL algorithms used. Although this dog would start in the location shown in Figure 1. “Bad” approachhashadsomesuccess[8,9],therearecomplications dogs walked straight through the green cells to the yellow thatlimititsapplicability.Inparticular,atrainermusttakecare cell. “Alright” dogs first moved left, then up, and then to the thatthefeedbacktheygivecontainsnounanticipatedexploits, goal, avoiding green but not taking the shortest route. “Good” constrainingthefeedbackstrategiestheycanuse.Indeed,prior dogs took the shortest path to yellow without going through research has shown that interpreting human feedback like a green. rewardfunctionofteninducespositiverewardcyclesthatleads Duringtraining,participantssawthedogtakeanactionfrom to unintended behaviors [10, 11]. one tile to another and then gave feedback after every action The issues with interpreting feedback as reward have led using a continuous labeled slider as shown. The slider always to the insight that human feedback is better interpreted as started in the middle of the scale on each trial, and several a comment on the agent’s behavior; for example, positive pointswerelabeledwithdifferentlevelsofreward(praiseand feedbackroughlycorrespondsto“thatwasgood”andnegative treats) and punishment (scolding and a mild electric shock). feedback roughly corresponds to “that was bad.” Existing Participants went through a brief tutorial using this interface. HCRL work adopting this perspective includes TAMER [1], Responses were coded as a numeric value from −50 to 50, SABL[6],andPolicyShaping[5],discussedinmoredetailin with “Do Nothing” as the zero-point. theRelatedWorksection.Wenote,though,thatallassumethat During the training phase, participants trained a dog for human feedback is independent of the agent’s current policy. three episodes that all started in the same position and ended gradually decrease positive feedback for good actions as the agent adopts those actions. Differential Feedback: vary the magnitude of feedbacks w.r.t. the degree of improvement or deterioration in behavior. Policy Shaping: provide positive feedback for suboptimal actions that improve behavior and then negative feedback after the improvement has been made. Diminishingreturnsisausefulstrategybecauseitdecreases the burden of how actively a trainer must supply feedback and removes the need for explicit training and execution phases. Differential feedback is useful because it can serve to highlight the most important behaviors in the state space andcommunicateakindofurgencyforlearningthem.Finally, policyshapingconcernsfeedbackthatsignalsanimprovement Fig.2. Boxplotswithmedian,interquartilerange,andminimum/maximum relative to the current baseline—as in the AMT study above. valuesofhumanresponsestofirststepofthefinalepisodeforeachcondition. Feedbacktendedtobepositivewhenthepriorbehaviorwasbadandnegative It is useful for providing a direction for the learner to follow otherwise. at all times, even when the space of policies is continuous or otherwise impractical to search. at the goal. The dog’s behavior was pre-programmed for all V. CONVERGENTACTOR-CRITICBYHUMANS episodes in such a way that the first step of the final episode In this section, we introduce Convergent Actor-Critic by would reveal if feedback was policy-dependent. The dog Humans (COACH), an actor-critic-based algorithm capable of always performed the same behavior in the first two episodes, learning from policy-dependent feedback. COACH is based andthenperformedthe“alright”behaviorinthethirdepisode. on the insight that the advantage function is a good model Each user was placed into one of three different conditions of human feedback and that actor-critic algorithms update that determined how the dog would behave in the first two a policy using the critic’s TD error, which is an unbiased episodes: either “bad,” “alright,” or “good.” If feedback is estimate of the advantage function. Consequently, an agent’s policy dependent, we expect more positive feedback in the policy can be directly modified by human feedback without “bad” condition than in the “alright” or “good” condition; a critic component. We first define the advantage function “alright” behavior is an improvement over the previous “bad” and describe how it relates to the three previously mentioned behaviorwhereasitiseithernoimprovementoradeterioration training strategies. Then, we present the general update rule compared to “alright” or “good” behavior. forCOACHanditsconvergence.Finally,wepresentReal-time Figure 2 shows boxplots and individual responses for the COACH, which includes mechanisms for providing variable firststepofthefinalepisodeundereachofthethreeconditions. magnitude feedback and learning in problems with a high- These results indicate that the sign of feedback is sensitive frequency decision cycle. to the learner’s policy, as predicted. The mean and median feedback under the “bad” condition is slightly positive (Mean A. The Advantage Function and Training Strategies =9.8,Median=24,S.D.=22.2;plannedWilcoxonone-sided signed-rank test: Z = 1.71,p < 0.05), whereas it is negative The advantage function [14] Aπ is defined as for the “alright” condition (Mean = −18.3, Median = −23.5, Aπ(s,a)=Qπ(s,a)−Vπ(s). (1) S.D. = 24.6; planned Wilcoxon two-sided signed-rank test: Z =−3.15,p<0.01) and “good” condition (Mean = −10.8, Roughlyspeaking,theadvantagefunctiondescribeshowmuch Median = −18.0, S.D. = 20.7; planned Wilcoxon one-sided better or worse an action selection is compared to the agent’s signed-rank test: Z = −2.33,p < 0.05). There was a main performance following policy π. We now show that feedback effect across the three conditions (p<0.01, Kruskal-Wallace assigned according to the advantage function follows the Test), and pairwise comparisons indicated that only the “bad” patterns of all three training strategies. Showing that the condition differed from “alright” and “good” conditions (p< advantage function captures the differential feedback strategy 0.01 for both, Bonferroni-corrected, Mann-Whitney Pairwise is trivial, because the advantage function is defined by how test). muchbettertakinganactionoveritscurrentpolicyisexpected to be. B. Training Strategies To show that the advantage function induces diminishing Beyond the fact that our previous evidence suggests that returns, consider what happens when the learner improves people give policy-dependent feedback, we argue that policy- its behavior by shifting to a higher scoring action a in dependent feedback affords desirable training strategies. some state s. As its probability of selecting a goes to 1, Specifically,weconsiderthreedifferentfeedbackschemesthat Aπ(s,a) = Qπ(s,a) − Qπ(s,a) = 0, because the value can be viewed as operationalizations of well-studied behavior function Vπ(s) = (cid:80) π(s,a)Qπ(s,a). Since the expected a analysis reinforcement schedules [13]: Diminishing Returns: value is a smooth linear combination of the Q-values, as the agent adopts action a, Aπ(s,a) → 0, resulting in gradually Algorithm 1 Real-time COACH decreasing feedback. Input: policy πθ0, trace set λ, delay d, learning rate α Toshowthattheadvantagefunctioninducespolicyshaping, Initialize traces eλ ←0 ∀λ∈λ letusfirstassumew.l.o.g.thattherearethreepossibleactions, observe initial state s0 where action a is exclusively optimal, and action a is for t=0 to ∞ do 3 2 better than action a1. That is, Q∗(s,a3) > Q∗(s,a2) > select and execute action at ∼πθt(st,·) Q∗(s,a1). For illustrative purposes, let us also assume that observe next state st+1, sum feedback ft+1, and λ the agent has learned the optimal policy in all states ex- for λ(cid:48) ∈λ do cseelpetctsstattehes,wwohrsetreianctitohne aagen(tπw(sit,han)ea≈r-cer1t)ainanpdrotbhaabtiliatlyl endefλo(cid:48)r←λ(cid:48)eλ(cid:48) + πθt(st−1d,at−d)∇θtπθt(st−d,at−d) 1 1 actions lead to some other state. In this scenario, Aπ(s,a)= θt+1 ←θt+αft+1eλ Q∗(s,a) − (cid:80) π(s,a(cid:48))Q∗(s,a(cid:48)) ≈ Q∗(s,a) − Q∗(s,a ). end for a(cid:48) 1 Since Q∗(s,a ) > Q∗(s,a ), it follows that Aπ(s,a ) is 2 1 2 positive.Consequently,inthiscondition,suboptimalactiona 2 would receive positive feedback. C. Real-time COACH Now consider the case when the agent with near-certainty There are challenges in implementing Equation 2 for real- selects optimal action a . Under this condition, Aπ(s,a) ≈ time use in practice. Specifically, the interface for providing 3 Q∗(s,a) − Q∗(s,a ) and since Q∗(s,a ) > Q∗(s,a ), variable magnitude feedback needs to be addressed, and the 3 3 2 Aπ(s,a ) is negative. Once again, since Vπ is a smooth question of how to handle sparseness and the timing of 2 linear function of Qπ, as the agent adopts optimal action feedback needs to be answered. Here, we introduce Real-time a ,Aπ(s,a )becomesnegative.Therefore,suboptimalactions COACH, shown in Algorithm 1, to address these issues. 3 2 can be rewarded and then punished as the agent improves at For providing variable magnitude reward, we use reward the task, producing policy shaping. aggregation [1]. In reward aggregation, a trainer selects from a discrete set of feedback values and further raises or lowers the numeric value by giving multiple feedbacks in succession B. Convergence and Update Rule that are summed together. Given a performance metric ρ, Sutton et al. [15] derive While sparse feedback is not especially problematic (be- a policy gradient algorithm of the form: ∆θ = α∇ ρ. cause no feedback results in no change in policy), it may θ Here, θ represents the parameters that control the agent’s slow down learning unless the trainer is provided with a behavior and α is a learning rate. Under the assumption that mechanismtoallowfeedbacktoaffectahistoryofactions.We ρ is the discounted expected reward from a fixed start state useeligibilitytraces[16]tohelpapplyfeedbacktotherelevant distribution, they show that transitions. An eligibility trace is a vector that keeps track of thepolicygradientanddecaysexponentiallywithaparameter (cid:88) (cid:88) ∇ ρ= dπ(s) ∇ π(s,a)Qπ(s,a), λ. Policy parameters are then updated in the direction of the θ θ s a trace, allowing feedback to affect earlier decisions. However, a trainer may not always want to influence a long history of where dπ(s) is the component of the (discounted) stationary actions. Consequently, real-time COACH maintains multiple distribution at s. A benefit of this form of the gradient is that, eligibility traces with different temporal decay rates and the giventhatstatesarevisitedaccordingtodπ(s)andactionsare trainerchooseswhicheligibilitytracetouse.Thistracechoice taken according to π(s,a), the update at time t can be made may be handled implicitly with the feedback value selection as: or explicitly. f ∆θ =α ∇ π(s ,a ) t+1 , (2) Due to reaction time, human feedback is typically delayed t t θ t t π(st,at) byabout0.2to0.8secondsfromtheeventtowhichtheymeant to give feedback [10]. To handle this delay, feedback in Real- whereE[f ]=Qπ(s ,a )−v(s)foranyaction-independent t+1 t t time COACH is associated with events from d steps ago to function v(s). cover the gap. Eligibility traces further smooth the feedback In the context of the present paper, ft+1 represents the to older events. feedback provided by the trainer. It follows trivially that Finally, we note that just as there are numerous variants of if the trainer chooses the policy-dependent feedback ft = actor-critic update rules, similar variations can be used in the Qπ(st,at), we obtain a convergent learning algorithm that context of COACH. (locally) maximizes discounted expected reward. In addition, feedback of the form f = Qπ(s ,a )−Vπ(s ) = Aπ(s,a) VI. RELATEDWORK t t t t alsoresultsinconvergence.Notethatforthetrainertoprovide AninspirationforourworkistheTAMERframework[10]. feedback in the form of Qπ or Aπ, they would need to “peer In TAMER, trainers provide interactive numeric feedback as inside” the learner and observe its policy. In practice, the the learner takes actions. The feedback is interpreted as an trainer estimates π by observing the agent’s actions. exemplar of the reward function for the previous state–action pair and is used to learn the reward function. When the was to test that COACH can scale to a complex domain agent makes rapid decisions, TAMER divides the feedback involving multiple challenges, including training an agent among the recent state–action pairs according to a probability that operates on a fast decision cycle (33ms), noisy non- distribution.TAMERmakesdecisionsbymyopicallychoosing Markov observations from a camera, and agent perception the action with the highest reward estimate. Because the that is hidden from the trainer. To demonstrate the flexibility agent myopically maximizes reward, the feedback may be of COACH, we trained it to perform five different behaviors interpreted as exemplars of Q∗. Later work investigated non- involvingapinkballandcylinderwithanorangetopusingthe myopically maximizing the learned reward function with a same parameter selections. We discuss these behaviors below. planning algorithm [17], but this approach requires a model WealsocontrasttheresultstotrainingwithTAMER.Wechose of the environment and special treatment of termination TAMER as a comparison because, to our knowledge, it is the conditions. Because TAMER expects feedback to be policy onlyHCRLalgorithmwithsuccessonasimilarplatform[23]. independent, it does not support the diminishing returns or TheTurtleBotisamobilebasewithtwodegreesoffreedom policy-shaping strategies, and handles differential feedback that senses the world from a Kinect camera. We discretized only in so far as it uses numeric feedback. In our robotics the action space to five actions: forward, backward, rotate case study, we provide an explicit example where TAMER’s clockwise,rotatecounterclockwise,anddonothing.Theagent failuretosupportdiminishingreturnscanresultinunlearning. selects one of these actions every 33ms. To deliver feedback, TwoothercloselyrelatedapproachesareSABL[6]andPol- we used a Nintendo Wii controller to give +1, +4, or icyShaping[5](unrelatedtothepolicyshapingfeedbackstrat- −1 numeric feedback, and pause and continue training. For egy defined above). Both of these approaches treat feedback perception, we used only the RGB image channels from the as discrete probabilistic evidence of a parametrized policy. Kinect.Becauseourbehaviorswerebasedaroundarelocatable SABL’s probabilistic model additionally includes (learnable) pink ball and a fixed cylinder with an orange top, we hand parameters for describing how often a trainer is expected constructed relevant image features to be used by the learning to give explicit positive or negative feedback. Both of these algorithms. These features were generated using techniques approach assume policy-independent feedback and do not similar to those used in neural network architectures. In the support the three training strategies described. future, we will investigate learning these features along with There have also been some domains in which treating thepolicy.Thefeatureswereconstructedbyfirsttransforming human feedback as reward signals to maximize has had some the image into two color channels associated with the color success,suchasinshapingthecontrolforaprostheticarm[8] of the ball and cylinder. Sum pooling to form a lower- and learning how to interact in an online chat room from dimensional 8 × 8 grid was applied to each color channel. multiple users’ feedback [9]. An interesting area of future Eachsum-poolingunitwasthenpassedthroughthreedifferent work is to test whether performance in these domains can normalized threshold units defined by T (x) = min(x,1), i φi be improved with COACH given our insights into the nature whereφ specifieshowquicklytheiththresholdunitsaturates. i of human feedback—work on the chat-room learning domain Usingmultiplesaturationparametersdifferentiatesthedistance did report challenges due to diminishing feedback. of objects, resulting in three “depth” scales per color channel. Some research has examined combining human feedback Finally, we passed these results through a 2×8 max-pooling with more traditional environmental rewards [18, 19, 20, 21]. layer with stride 1. A challenge in this context in practice is that rewards do not The five behaviors we trained were push–pull, hide, ball naturally come from the environment and must be program- following,alternate,andcylindernavigation.Inpush–pull,the matically defined. However, it is appealing because the agent TurtleBot is trained to navigate toward the ball when it is far, can learn in the absence of an active trainer. We believe that and back away from it when it is near. The hide behavior COACH could be straightforwardly modified to learn in this has the TurtleBot back away from the ball when it is near setting as well. and turn away from it when it is far. In ball following, the Finally, a related research area is learning from demon- TurtleBot is trained to navigate to the ball. In the alternate stration (LfD), in which a human provides examples of the task, the TurtleBot is trained to go back and forth between desired behavior. There are a number of different approaches the cylinder and ball. Finally, cylinder navigation involves the to solving this problem surveyed by Argall et al. [22]. We see agent navigating to the cylinder. We further classify training these approaches as complementary to HCRL because it is methodsforeachofthesebehaviorsasflat,involvingthepush– notalwayspossible,orconvenient,toprovidedemonstrations. pull, hide, and ball following behaviors; and compositional, LfD approaches that learn a parametrized policy could also involving the alternate and cylinder navigation behaviors. operate with COACH, allowing the agent to have their policy In all cases, our human trainer (one of the co-authors) seededbydemonstrations,andthenfinetunedwithinteractive used differential feedback and diminishing returns to quickly feedback. reinforce behaviors and restrict focus to the areas needing tuning. However, in alternate and cylinder navigation, they VII. ROBOTICSCASESTUDY attemptedmoreadvancedcompositionaltrainingmethods.For Inthissection,wepresentqualitativeresultsapplyingReal- alternate, the agent was first trained to navigate to the ball time COACH to a TurtleBot robot. The goal of this study when it sees it, and then turn away when it is near. Then, the same was independently done for the cylinder. After We believe TAMER’s forgetting is a result of interpreting training, introducing both objects would cause the agent to feedbackasreward-functionexemplarsinwhichnewfeedback move back and forth between them. For cylinder navigation, in similar contexts can change the target. To illustrate this they attempted to make use of an animal-training method problem, we constructed a well-defined scenario in which called lure training in which an animal is first conditioned to TAMER consistently unlearns behavior. In this scenario, the follow a lure object that is then used to guide it through more goal was for the TurtleBot to always stay whenever the ball complexbehaviors.Incylindernavigation,theyfirsttrainedthe waspresent,andmoveforwardifjustthecylinderwaspresent. balltobealure,usedittoguidetheTurtleBottothecylinder, We first trained TAMER to stay when the ball alone was andfinallygavea+4rewardtoreinforcethebehaviorsittook presentusingmanyrapidrewards(yieldingalargeaggregated when following the ball (turning to face the cylinder, moving signal).Next,wetrainedittomoveforwardwhenthecylinder toward it, and stopping upon reaching it). The agent would alone was present. We then introduced both objects, and the then navigate to the cylinder without requiring the ball to be TurtleBot correctly stayed. After rewarding it for staying with present. a single reward (weaker than the previously-used many rapid For COACH parameters, we used a softmax parameter- rewards), the TurtleBot moved forward. This counter-intuitive ized policy, where each action preference value was a linear unlearningisaconsequenceofthesmallrewarddecreasingits function of the image features, plus tanh(θ ), where θ is reward-functiontargetforthestayactiontoapointlowerthan a a a learnable parameter for action a, providing a preference in the value for moving forward. COACH does not exhibit this the absence of any stimulus. We used two eligibility traces problem—anyrewardforstayingwillstrengthenthebehavior. with λ = 0.95 for feedback +1 and −1, and λ = 0.9999 VIII. CONCLUSION for feedback +4. The feedback-action delay d was set to 6, which is 0.198 seconds. Additionally, we used an actor- In this work, we presented empirical results that show that criticparameter-updaterulevariantinwhichactionpreference the numeric feedback people give agents in an interactive values are directly modified (along its gradient), rather than training paradigm is influenced by the agent’s current pol- by the gradient of the policy [24]. This variant more rapidly icy and argued why such policy-dependent feedback enables communicates stimulus–response preferences. For TAMER, useful training strategies. We then introduced COACH, an we used typical parameter values for fast decision cycle algorithm that, unlike existing human-centered reinforcement- problems: delay-weighted aggregate TAMER with uniform learning algorithms, converges to a local optimum when distributioncreditassignmentover0.2to0.8seconds,(cid:15) =0, trainedwithpolicy-dependentfeedback.Wefinallyweshowed p and c = 1 [10]. (See prior work for parameter meaning.) thatCOACHscalesupinthecontextofaroboticscasestudyin min For TAMER’s reward function approximation, we used the which a TurtleBot was successfully taught multiple behaviors same parameters as the action preferences in COACH. with advanced training methods. There are a number of exciting future directions to extend thiswork.Inparticular,becauseCOACHisbuiltontheactor- A. Results and Discussion criticparadigm,itshouldbepossibletocombineitstraightfor- COACH was able to successfully learn all five behaviors wardly with learning from demonstration and environmental and a video showing its learning is available online at https: rewards, allowing an agent to be trained in a variety of //vid.me/3h2s. Each of these behaviors were trained in less ways.Second,becausepeoplegivepolicy-dependentfeedback, thantwominutes,includingthetimespentverifyingthatabe- greater gains may be possible by investigating how people havior worked. Differential feedback and diminishing returns model the current policy of the agent and how their model allowed only the behaviors in need of tuning to be quickly differs from the agent’s actual policy. reinforced or extinguished without any explicit division be- tween training and testing. Moreover, the agent successfully REFERENCES benefited from the compositional training methods, correctly [1] W. B. Knox and P. Stone, “Interactively shaping agents combining subbehaviors for alternate, and quickly learning via human reinforcement: The tamer framework,” in cylinder navigation with the lure. Proceedings of the fifth international conference on TAMER only successfully learned the behaviors using the Knowledge capture. ACM, 2009, pp. 9–16. flat training methodology and failed to learn the composi- [2] A. L. Thomaz and C. Breazeal, “Reinforcement learning tionally trained behaviors. In all cases, TAMER tended to withhumanteachers:Evidenceoffeedbackandguidance forget behavior, requiring feedback for previous decisions it with implications for learning performance,” in AAAI, learnedtoberesuppliedafteritlearnedanewdecision.Forthe vol. 6, 2006, pp. 1000–1005. alternate behavior, this forgetting led to failure: after training [3] A.ThomazandC.Breazeal,“Robotlearningviasocially thebehaviorforthecylinder,theagentforgotsomeoftheball- guided exploration,” in ICDL Development on Learning, related behavior and ended up drifting off course when it was 2007. time to go to the ball. TAMER also failed to learn from lure [4] A. L. Thomaz and C. Breazeal, “Teachable robots: training, which was expected since TAMER does not allow Understanding human teaching behavior to build more reinforcing a long history of behaviors. effective robot learners,” 2008. [5] S. Griffith, K. Subramanian, J. Scholz, C. Isbell, and Agents and Multiagent Systems, 2010. A. L. Thomaz, “Policy shaping: Integrating human feed- [19] A.C.Tenorio-Gonzalez,E.F.Morales,andL.Villasen˜or- backwithreinforcementlearning,”inAdvancesinNeural Pineda, “Dynamic reward shaping: training a robot by Information Processing Systems, 2013, pp. 2625–2633. voice,” in Advances in Artificial Intelligence–IBERAMIA [6] R.Loftin,B.Peng,J.MacGlashan,M.L.Littman,M.E. 2010. Springer, 2010, pp. 483–492. Taylor,J.Huang,andD.L.Roberts,“Learningbehaviors [20] J. A. Clouse and P. E. Utgoff, “A teaching method for viahuman-delivereddiscretefeedback:modelingimplicit reinforcement learning,” in Proceedings of the Ninth In- feedback strategies to speed up learning,” Autonomous ternational Conference on Machine Learning (ICML92), Agents and Multi-Agent Systems, vol. 30, no. 1, pp. 30– 1992, pp. 92–101. 59, 2015. [21] R. Maclin, J. Shavlik, L. Torrey, T. Walker, and E. Wild, [7] S. Bhatnagar, R. S. Sutton, M. Ghavamzadeh, and “Giving advice about preferred actions to reinforce- M. Lee, “Natural actor–critic algorithms,” Automatica, ment learners via knowledge-based kernel regression,” vol. 45, no. 11, pp. 2471–2482, 2009. in Proceedings of the National Conference on Artificial [8] P. M. Pilarski, M. R. Dawson, T. Degris, F. Fahimi, intelligence, vol. 20, no. 2, 2005, p. 819. J. P. Carey, and R. S. Sutton, “Online human training [22] B. D. Argall, S. Chernova, M. Veloso, and B. Brown- of a myoelectric prosthesis controller via actor-critic ing, “A survey of robot learning from demonstration,” reinforcement learning,” in 2011 IEEE International Robotics and autonomous systems, vol. 57, no. 5, pp. Conference on Rehabilitation Robotics. IEEE, 2011, 469–483, 2009. pp. 1–7. [23] W.B.Knox,P.Stone,andC.Breazeal,“Trainingarobot [9] C. Isbell, C. R. Shelton, M. Kearns, S. Singh, and via human feedback: A case study,” in Social Robotics. P. Stone, “A social reinforcement learning agent,” in Springer, 2013, pp. 460–470. Proceedings of the fifth international conference on Au- [24] R. S. Sutton and A. G. Barto, Reinforcement learning: tonomous agents. ACM, 2001, pp. 377–384. An introduction. MIT press Cambridge, 1998, vol. 1, [10] W. B. Knox, “Learning from human-generated reward,” no. 1. Ph.D. dissertation, University of Texas at Austin, 2012. [11] M. K. Ho, M. L. Littman, F. Cushman, and J. L. Austerweil, “Teaching with rewards and punishments: Reinforcementorcommunication?”inProceedingsofthe 37th Annual Meeting of the Cognitive Science Society, 2015. [12] W. B. Knox, B. D. Glass, B. C. Love, W. T. Maddox, and P. Stone, “How humans teach agents,” International Journal of Social Robotics, vol. 4, no. 4, pp. 409–421, 2012. [13] R. G. Miltenberger, Behavior modification: Principles and procedures. Cengage Learning, 2011. [14] L. Baird et al., “Residual algorithms: Reinforcement learningwithfunctionapproximation,”inProceedingsof thetwelfthinternationalconferenceonmachinelearning, 1995, pp. 30–37. [15] R. S. Sutton, D. A. McAllester, S. P. Singh, Y. Mansour et al., “Policy gradient methods for reinforcement learn- ingwithfunctionapproximation.”inNIPS,vol.99,1999, pp. 1057–1063. [16] A. Barto, R. Sutton, and C. Anderson, “Neuronlike adaptiveelementsthatcansolvedifficultlearningcontrol problems,” Systems, Man and Cybernetics, IEEE Trans- actions on, vol. SMC-13, no. 5, pp. 834 –846, sept.-oct. 1983. [17] W. B. Knox and P. Stone, “Learning non-myopically from human-generated reward,” in Proceedings of the 2013 international conference on Intelligent user inter- faces. ACM, 2013, pp. 191–202. [18] B. Knox and P. Stone, “Combining manual feedback with subsequent MDP reward signals for reinforcement learning,” in Proc. of 9th Int. Conf. on Autonomous

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.