Reinforcement Learning via Recurrent Convolutional Neural Networks Tanmay Shankar Santosha K Dwivedy Prithwijit Guha Department of Mech. Engg. Department of Mech. Engg. Department of EEE IIT Guwahati IIT Guwahati IIT Guwahati [email protected] [email protected] [email protected] Abstract—DeepReinforcementLearninghasenabledthelearn- problems within the framework of Deep Learning architec- ing of policies for complex tasks in partially observable envi- tures; in particular Recurrent Convolutional Neural Networks ronments, without explicitly learning the underlying model of (RCNNs).ByexplicitlyconnectingthestepsofsolvingMDPs the tasks. While such model-free methods achieve considerable with architectural elements of RCNNs, our representation performance,theyoftenignorethestructureoftask.Wepresenta naturalrepresentationoftoReinforcementLearning(RL)prob- inherits properties of such networks; allowing the use of lems using Recurrent Convolutional Neural Networks (RCNNs), backpropagationonthedefinedRCNNsasanelegantsolution 7 1 tobetterexploitthisinherentstructure.Wedefine3suchRCNNs, tomodel-basedRLproblems.Therepresentationalsoexploits whose forward passes execute an efficient Value Iteration, prop- 0 the inherent structure of the MDP to reduce the cost of agate beliefs of state in partially observable environments, and 2 replanning, further incentivizing model-based approaches. choose optimal actions respectively. Backpropagating gradients n through these RCNNs allows the system to explicitly learn the The contributions of this paper are hence three-fold. We a Transition Model and Reward Function associated with the define a Value Iteration RCNN (VI RCNN), whose forward J underlying MDP, serving as an elegant alternative to classical passes carry out Value Iteration to efficiently obtain the 9 model-based RL. We evaluate the proposed algorithms in sim- optimalpolicy.Second,wedefineaBeliefPropagationRCNN ulation, considering a robot planning problem. We demonstrate (BP RCNN), to update beliefs of state via the Bayes Filter. ] thecapabilityofourframeworktoreducethecostofre-planning, G learn accurate MDP models, and finally re-plan with learnt Backpropagation through this network learns the underlying L models to achieve near-optimal policies. transition model of the agent in partially observable envi- . ronments. Finally, we define a QMDP RCNN by combining s c I. INTRODUCTION the VI RCNN and the BP RCNN. The QMDP RCNN com- [ putes optimal choices of actions for a given belief of state. Deep Reinforcement Learning (DRL) algorithms exploit Backpropagation through the QMDP RCNN learns the reward 1 v model-free Reinforcement Learning (RL) techniques, to functionofanexpertagent,inaLearningfromDemonstration 2 achievehighlevelsofperformanceonavarietyoftasks,often via Inverse Reinforcement Learning (LfD-IRL) setting [11]. 9 onparwithhumanexpertsinthesamedomain[1].TheseDRL The learnt reward function and transition model may then be 3 methods use deep networks to either approximate the action- used to re-plan via the VI RCNN, and the QMDP RCNN to 2 value functions as in Deep Q Networks [1]–[3], or directly make optimal action choices for new beliefs and the learnt 0 . parametrizing the policy, as in policy gradient methods [4]. Q-values. 1 1 While DRL methods adopt model-free approaches in order We note that [12] follows an approach mathematically 0 to generalize performance across various tasks, it is difficult 7 similar to the VI RCNN; however it differs in being a model- to intuitively understand the reasoning of DRL approaches in 1 freeapproach.Themodel-basedapproachfollowedherelearns : making a particular choice of action, since they often ignore MDP models transferable across environments and agents. v the underlying structure of the tasks. i Further, the QMDP RCNN approach proposed here follows X Incontrast,model-basedmethods[5]–[8]exploitthisinher- more naturally than the fully connected layer used in [12]. r ent structure to make decisions, based on domain knowledge. The gradient updates of the QMDP RCNN adopt an intuitive a The estimates of the transition model and reward function form, that further contributes to an intuitive understanding of associated with the underlying Markov Decision Process the action choices of the QMDP RCNN, as compared to [12]. (MDP) are transferable across environments and agents [9], We evaluate each of the RCNNs proposed in simulation, and provide insight into the system’s choice of actions. A sig- where the given task is a 2-D robot planning problem. We nificant deterrent from model-based RL is the indirect nature demonstrate the capacity of the VI RCNN to reduce the cost of learning with the estimated model; subsequent planning is of re-planning by significantly reducing the execution time required to obtain the optimal policy [10]. as compared to classical Value Iteration. We evaluate the BP To jointly exploit this inherent structure and overcome this indirect nature, we present a novel fusion of Reinforcement 1Please find our code and supplementary material at https:// and Deep Learning, by representing classical solutions to RL github.com/tanmayshankar/RCNN_MDP. Fig.1. SchematicrepresentationoftheValueIterationRecurrentConvolutionNetwork.Noticethe4stagespresent-convolution,Fixed-bias, Max-pooling and recurrence. The VI RCNN facilitates a natural representation of the Bellman update as a single RCNN layer. RCNN based on the accuracy of the learnt transition models by (u,v), we may express V (s) as max R(s,a) + k+1 a againstgroundtruthmodels,andshowthatitappreciablyout- γ(cid:80)w (cid:80)w T(s,a,s(cid:48)) V (s(cid:48)) . We may rep- u=−w v=−w u,v k i−u,j−v performs naive model-based methods for partially observable resent this as a convolution: environments. Our experiments finally demonstrate that the V (s)=max R(s,a)+γ T(s,a,s(cid:48))∗V (s(cid:48)) (1) QMDP RCNN is able to generate policies via re-planning k+1 a k with learnt MDP models, that accurately represent policies The Value Iteration RCNN: We define a Value Iteration generated from ground truth MDP models. RCNN (VIRCNN)torepresentclassicalValueIteration,based on the correlation between the Bellman update equation, and II. THEVALUEITERATIONRCNN the architectural elements of an RCNN. Equation (1) can be In this section, we formulate Value Iteration as a recurrent thoughtofasasinglelayerofarecurrentconvolutionalneural convolution, and hence present the Value Iteration RCNN (VI network consisting of the following 4 stages: RCNN) to carry out Value Iteration. We consider a standard 1) ConvolutionalStage:TheconvolutionofT(s,a,s(cid:48))∗V (s(cid:48)) k Markov Decision Process, consisting of a 2-dimensional state represents the convolutional stage. space S of size N × N , state s ∈ S, action space A d d 2) Max-Pooling Stage: The maximum at every state s taken of size N , actions a ∈ A, a transition model T(s,a,s(cid:48)), a over all actions a, max is analogous to a max-pooling a rewardfunctionR(s,a),anddiscountfactorγ.ValueIteration, stage along the action ‘channel’. which is typically used in solving for optimal policies in an 3) Fixed Bias Stage: The introduction of the reward term MDP, invokes the Bellman Update equation as V (s) = k+1 R(s,a) is treated as addition of a fixed bias. max R(s,a)+γ(cid:80) T(s,a,s(cid:48))V (s(cid:48)). a s(cid:48) k 4) Recurrence Stage: As k is incremented in successive iter- Value Iteration as a Convolution: In our 2-D state space, S, ations, V (s) is fed back as an ‘input’ to the network, k+1 statessands(cid:48)aredefinedby2pairsofindices(i,j)and(p,q). introducing a recurrence relation into the network. The summation over all states s(cid:48) may hence be equivalently WemaythinkofV (s(cid:48))asasingle-channelN ×N image. represented as (cid:80) = (cid:80)Nd (cid:80)Nd . The transition model is k d d s(cid:48) p=1 q=1 T(s,a,s(cid:48))thencorrespondstoaseriesofN transitionfilters, hence of size N2 ×N ×N2. We may express V (s) as a d a d k+1 each of N ×N size (N = 2w+1), each to be convolved max R(s,a)+γ(cid:80)Nd (cid:80)Nd T(s,a,s(cid:48)) V (s(cid:48)) . t t t a p=1 q=1 p,q k p,q with the image. The values of the transition filters correspond Actions correspond to transitions in the agent’s state, dic- directly to transition probabilities between states s and s(cid:48), tated by the internal dynamics of the agent, or in the case upon taking action a. Note that these convolutional transition of physical robots, restricted by the robot’s configuration filters naturally capture the spatial invariance of transitions by and actuator capacities. It is impossible for the agent to virtue of lateral parameter sharing inherent to convolutional immediately move from its current state to a far away state. networks. Further, in this representation, each transition filter This allows us to define a w neighborhood centred around corresponds directly to a particular action. This unique one- the state s(cid:48), W(s(cid:48)), so the agent only has a finite probability to-one mapping between actions and filters proves to be very of transitioning to other states within W. Mathematically, useful in learning MDP or RL models, as we demonstrate in T(s,a,s(cid:48))(cid:54)=0if|i−p|≤w and|j−q|≤w orifs∈W(s(cid:48)), section III. Finally, the VI RCNN is completely differentiable, and T(s,a,s(cid:48))=0 for s(cid:54)∈W(s(cid:48)). and can be trained by applying backpropagation through it. Further, the transition model is often invariant to the spatial location of the occurring transition. For example, III. THEBELIEFPROPAGATIONRCNN a differential drive robot’s dynamics are independent of In this section, we present the Belief Propagation RCNN the robot’s position. We assume the transition model is (BP RCNN), to represent the Bayes filter belief update within stationary over the state space, incorporating appropriate the architecture of an RCNN, and hence learn MDP transition state boundary conditions. We may now visualize this w models. For an agent to make optimal choices of actions neighborhood restricted transition model itself to be cen- in partially observable environments, an accurate belief of tred around s(cid:48). By defining a flipped transition model as: state b(s) is required. Upon taking an action a and receiving T(s,a,s(cid:48)) = T(s,a,s(cid:48)) , and indexing it observation z, we may use the Bayes filter to propagate the m,n Nt−m,Nt−n The Belief Propagation RCNN: We define a Belief Propaga- tion RCNN (BP RCNN) to represent the Bayes Filter belief update, analogous to the representation of Value Iteration as the VI RCNN. A single layer of the BP RCNN represents equation (3), consisting of the following stages: 1) Convolution Stage: The convolution T(s,a,s(cid:48))∗b(s) rep- resents the convolution stage of the BP RCNN. 2) Element-wise Product Stage: The element-wise multiplica- tion O(sz)(cid:12)b(s(cid:48)), can be considered as an element-wise product (or Hadamard product) layer in 2D. 3) Recurrence Stage: The network output, b(s(cid:48)), is fed as the input b(s) at the next time step, forming an output Fig.2. SchematicrepresentationoftheBeliefPropagationRecurrent recurrence network. Convolution Network. Forward passes represented by black arrows, Forward passes of the BP RCNN propagate beliefs of state, while backpropagation is symbolized by blue arrows. given a choice of action a and a received observation z . t t The belief of state at any given instant, b(s), is treated as a belief b(s(cid:48)) in terms of an observation model O(s(cid:48),a,z), singlechannelN ×N image,andisconvolvedwithasingle d d associatedwithprobabilityp(z|s(cid:48),a).ThediscreteBayesfilter transition filter T , corresponding to the action executed, a . is depicted in (2). Note the denominator is equivalent to at t A key difference between the VI RCNN and the BP RCNN is p(z|a,b), and can be implemented as a normalization factor thatonlyasingletransitionfilterisactivated duringthebelief η. O(s(cid:48),a,z)(cid:80) T(s,a,s(cid:48))b(s) propagation, since the agent may only execute a single action b(s(cid:48))= s∈S (2) at any given time instant. The BP RCNN is also completely (cid:80) O(s(cid:48),a,z)(cid:80) T(s,a,s(cid:48))b(s) s(cid:48)∈S s∈S differentiable. TrainingandLoss:LearningtheweightsoftheBPRCNN via BayesFilterasaConvolutionandanElement-wiseProduct: backpropagation learns the transition model of the agent. Our We may represent the motion update of the traditional Bayes objective is thus to learn a transition model T(cid:48)(s,a,s(cid:48)) such filter as a convolution, analogous to the representation of (cid:80) T(s,a,s(cid:48))V (s(cid:48)) step of Value Iteration as T (s,a,s(cid:48))∗ that network output b(s(cid:48)) is as close to the target belief b(cid:100)(s(cid:48)) s(cid:48) k V (s(cid:48)) in equation (1). Transitions in a Bayes filter occur as possible, at every time step. Formally, we minimize a loss k from s to s(cid:48), in contrast to the ‘backward’ passes executed functiondefinedastheleastsquareerrorbetweenbothbeliefs in Value Iteration, from s(cid:48) to s. This suggests our transition over all states: Lt = (cid:80)Ni=d1(cid:80)Nj=d1 (cid:0)b(cid:91)(s(cid:48)t) − b(s(cid:48)t)(cid:1)2i,j, with modeliscentredaroundsratherthans(cid:48),andwemayrepresent conditions on the transition model, 0 ≤ T(cid:48)(s,a,s(cid:48))m,n ≤ 1, (cid:80)s as (cid:80)iN=d1(cid:80)Nj=d1. We write the intermediate belief b(s(cid:48))p,q and (cid:80)wm=−w(cid:80)wn=−wT(cid:48)(s,a,s(cid:48))m,n = 1 ∀ a ∈ A, being as (cid:80)w (cid:80)w T(s,a,s(cid:48)) b(s) , which can be incorporated using appropriate penalties. u=−w v=−w u,v p−u,q−v expressed as a convolution b(s(cid:48)) = T(s,a,s(cid:48)) ∗ b(s). For The BP RCNN is trained in a typical RL setting, with the Bayes Filter correction update, we consider a simple an agent interacting with an environment by taking random observation model O(z,s(cid:48)) for observation z in state s(cid:48). action choices (at,at+1...), and receiving corresponding ob- The agent observes its own state as szx,y, indexed by x and servations, (zt,zt+1...). The target beliefs, b(cid:100)(s(cid:48)), are gener- y, given that the actual state of the robot is s(cid:48) , with a ated as one-hot representations of the observed state. While i,j probability p(sz |s(cid:48) ). Analogous to the w neighborhood for training the BP RCNN, the randomly initialized transition x,y i,j the transition model, we introduce a h neighborhood for the model magnifies the initial uncertainty in the belief as it is observation model, centred around the observed state sz . propagated forward in time, leading to instability in learning x,y ThusO(sz ,s(cid:48))(cid:54)=0forstatess(cid:48) withinthehneighborhood thetransitionmodel.Teacherforcing[13]isadoptedtohandle x,y i,j H(sz ), and O(sz ,s(cid:48))=0 for all s(cid:48) (cid:54)∈H(sz ). this uncertainty; thus the target belief, rather than the network x,y x,y i,j x,y output,ispropagatedastheinputforthenexttimestep.Since ThecorrectionupdateintheBayesfiltercanberepresented thetargetbeliefisindependentoftheinitialuncertaintyofthe as a masking operation on the intermediate belief b(s(cid:48)), or an transition model, the network is able to learn a meaningful element-wise (Hadamard) product ((cid:12)). Since the observation set of filter values. Teacher forcing, or such target recurrence, O(sz) is also centred around sz, we may express b(s(cid:48)) as i,j also decouples consecutive time steps, reducing the backprop- η O(sz) b(s(cid:48)) ∀ i,j, which corresponds to a nor- i−x,j−y i,j agation through time to backpropagation over data points that malized element-wise product b(s(cid:48))=η O(sz)(cid:12)b(s(cid:48)). Upon are only generated sequentially. combining the motion update convolution and the correction update element-wise product, we may represent the Bayes IV. THEQMDPRCNN Filter as a convolutional stage followed by an element-wise We finally present the QMDP RCNN as a combination of product. We have: the VI RCNN and the BP RCNN, to retrieve optimal choices b(s(cid:48))=η O(sz)(cid:12)T(s,a,s(cid:48))∗b(s) (3) of actions, and learn the underlying reward function of the Fig. 3. Schematic representation of the QMDP RCNN, as a combination of the Value Iteration RCNN and the Belief Propagation RCNN. Notice the following stages: inner product, sum-pooling, and the softmax stage providing the output actions. The gradients propagated to the reward function may alternately be propagated to the Q-value estimate. Forward passes of the QMDP RCNN are used for planning in partially observable domains, and backpropagation learns the reward function of the MDP. The BP RCNN component can be trained in parallel to the QMDP RCNN. POMDP (in addition to the transition model). Treating the An expert agent demonstrates a set of tasks, recorded as a outputsoftheVIRCNN andtheBPRCNN astheaction-value series of trajectories {(a ,z ),(a ,z )...}, which serve t t t+1 t+1 function, Q(s,a), and the belief of state, b(s) respectively as inputs to the network. The actions executed in the expert at every time step, we may fuse these outputs to compute trajectories are represented in a one-hot encoding, serving as belief space Q-values Q(b(s),a) using the QMDP approxi- the network targets ya. The objective is to hence learn a (cid:98)t (cid:80) mation.WehaveQ(b(s),a)= Q(s,a) b(s),which reward function such that the action choices output by the s∈S MDP may alternately be represented as a Frobenius Inner product network match the target actions at every time step. The (cid:104)b(s),Q(s,a)(cid:105) .Optimalactionchoicesya,correspondingto loss function chosen is the cross-entropy loss between the F t the highest belief space Q-values, and can be retrieved from output actions and the target actions at a given time step, the softmax of Q(b(s),a): defined as C = −(cid:80) yalnya. The QMDP RCNN learns t a(cid:98)t t therewardfunctionbybackpropagatingthegradientsretrieved eQ(b(s),a) ya = (4) from this loss function through the network, to update the t (cid:80) eQ(b(s),a) a rewardfunction.Theupdatedrewardestimateisthenfedback The QMDP RCNN: In addition to the stages of the VI RCNN totheVIRCNN componenttoupdatetheQ-values,andhence and BP RCNN, the QMDP RCNN is constructed from the the action choices of the network. Experience replay [14] is following stages: used to randomize over the expert trajectories and transitions 1) FrobeniusInnerProductStage:TheQMDPapproximation while training the QMDP RCNN. step Q(b(s),a) = (cid:104)b(s),Q(s,a)(cid:105) . represents an Frobe- Theclosedformofthegradientforrewardupdatesisofthe F nius inner product stage. This can be implemented as a formofR(s,a)←α−t −(ya−ya)b(s).The(ya−ya)termthus t (cid:98)t t (cid:98)t ‘valid’ convolutional layer with zero stride. dictatestheextenttowhichactionsarepositivelyornegatively 2) Softmax Stage: ya = softmax Q(b(s),a) represents the reinforced, while the belief term b(s) acts in a manner similar t softmax stage of the network. to an attention mechanism, which directs the reward function The resultant QMDP RCNN, as depicted in Figure 3, updates to specific regions of the state space where the agent combines planning and learning in a single network. Forward is believed to be. passes of the network provide action choices ya as an output, We emphasize that the QMDP RCNN differs from tra- t and backpropagating through the network learns the reward ditional DRL approaches in that the QMDP RCNN is not function,andupdatesQ-valuesandoptimalactionchoicesvia provided with samples of the reward function itself via its planning through the VI RCNN component. interaction with the environment, rather, it is provided with TrainingandLoss:TheQMDPRCNNistrainedinaLearning action choices made by the expert. While the LfD-IRL ap- fromDemonstration-InverseReinforcementLearningsetting. proach employed for the QMDP RCNN is on-policy, the in- TABLEI TransitionModelErrorandReplanningPolicyAccuracyforBPRCNN. ValuesinBoldrepresentbest-in-classperformance.Parametervaluesmarked*aremaintainedduringvariationofalternateparameters,andareusedduringfinalexperiments. Algorithm: BeliefPropagationRCNN WeightedCounting NaiveCounting Parameter Type TransitionError ReplanningAccuracy TransitionError ReplanningAccuracy TransitionError ReplanningAccuracy Environment FullyObservable 0.0210 96.882% 0.0249 96.564% 0.0249 96.564% Observability PartiallyObservable* 0.1133 95.620% 1.0637 75.571% 0.1840 83.590% OutputRecurrence 3.5614 21.113% 3.0831 27.239% TeacherForcing NouseofRecurrence. TargetRecurrence* 0.1133 95.620% 1.0637 75.571% RMSProp 2.8811 40.315% 2.5703 44.092% LearningRate LinearDecay 0.2418 93.596% 1.3236 52.451% NouseofLearningRate. Adaptation Filter-wiseDecay* 0.1133 95.620% 1.0637 75.571% compared to the original policy. A comparison is provided against the models learnt using a counting style algorithm analogoustothatusedin[15],aswellasaweighted-counting stylealgorithmthatupdatesmodel-estimatesbycountingover belief values. We experiment with the performance of these algorithms under both fully and partially observable settings, and whether teacher forcing is used (via target recurrence) or not(asinoutputrecurrence).Differentmethodsofadaptingthe learning rate are explored, including RMSProp, linear decay, and maintaining individual learning rates for each transition filter, or filter-wise decay, as presented in Table I. The BP RCNN’s loss function is defined in terms of the Fig. 4. Plot of the time per iteration of Value Iteration versus size non-stationary beliefs at any given time instant. An online of the State Space. The time taken by the VI RCNN is orders of trainingapproachishenceratherthanbatchmode.Weobserve magnitudelesserthanregularValueIteration,foralltransitionspace the filter-wise decay outperforms linear decay when different sizes. actionsarechosenwithvaryingfrequencies.RMSPropsuffers from the oscillations in the loss function that arise due to this built planning (via the VI RCNN) causes reward updates to dynamicnature,andhenceperformspoorly.Theuseofteacher permeate the entire state space. The dependence of the LfD- forcing mitigates this dynamic nature over time, increasing IRL approach in the QMDP RCNN on expert-optimal actions replanning accuracy from 21.11% to 95.62%. The 95.62% for training differs from the BP RCNN, which uses arbitrary replanning accuracy attained under partial-observability ap- (and often suboptimal) choices of actions. proaches the 96.88% accuracy of the algorithm under fully V. EXPERIMENTALRESULTSANDDISCUSSIONS observable settings. In this section, experimental results are individually pre- QMDP RCNN: For the QMDP RCNN, the objective is learn sented with respect to each of the 3 RCNNs defined. a reward function that results in a similar policy and level VI RCNN: Here, our contribution is a representation that of performance as the original reward. Since learning such enables a more efficient computation of Value Iteration. We a reward such that the demonstrated trajectories are optimal thus present the per-iteration run-time of Value Iteration via does not have a unique solution, quantifying the accuracy of the VI RCNN, versus a standard implementation of Value the reward estimate is meaningless. Rather, we present the Iteration. Both algorithms are implemented in Python, run on replanningaccuracy(asusedintheBPRCNN)forbothlearnt anIntelCorei7machine,2.13GHz,with4-cores.Theinherent and known transition models and rewards. We also run policy parallelization of the convolution facilitates the speed-up of evaluation on the generated policy using the original rewards, VI by several orders of magnitude, as depicted in Figure 4; at and present the increase in expected reward for the learnt best, the VI RCNN completes a single iteration 5106.42 times models;definedas 1 (cid:80) V(s)learnt−V(s)orig .Theresultsare faster, and at worst, it is 1704.43 times faster than the regular presented in Table NIId.2 s V(s)orig implementation. As is the case in the BP RCNN, RMSProp (and adaptive BP RCNN: The primary objective while training the BP learning rates in general) counter the magnitude of reward RCNN is to determine an accurate estimate of the transition updates dictated by the QMDP RCNN, and hence adversely model. We evaluate performance using the least square error affect performance. Experience Replay marginally increases between the transition model estimate and the ground truth the replanning accuracy, but has a significant effect on the model, defined as C(cid:48) = (cid:80) (cid:80) (cid:80) (Ta −T(cid:48)a )2. Since increaseinexpectedreward.Similarly,usingdelayedfeedback t a u v u,a u,v the final objective is to use these learnt models to generate (after passes through an entire trajectory) also boosts the new policies by replanning via the VI RCNN, we also present increase in expected reward. On learning both rewards and the replanning accuracy of each model; defined as the per- transition models, the QMDP RCNN achieves an appreciable centage of actions chosen correctly by the network’s policy, 65.120% policy error and a minimal −10.688% change in TABLEII ReplanningAccuracyandIncreaseinExpectedRewardforInverseReinforcementLearningfromDemonstrationsviatheQMDPRCNN. ValuesinBoldrepresentbest-in-classperformance.Parametervaluesmarked*aremaintainedduringvariationofalternateparameters,andareusedduringfinalexperiments. Algorithm: QMDPRCNN Weighted Naive Parameter: LearningRate ExperienceReplay FeedbackRecurrence Counting Counting Environment: ParameterValue: RMSProp LinearDecay* None Used* Immediate Delayed* LearntReward ReplanningAccuracy 45.113% 65.120% 62.211% 65.120% 65.434% 65.120% 45.102% 50.192% LearntTransition ExpectedRewardIncrease -75.030% -10.688% -16.358% -10.688% -16.063% -10.688% -74.030% -43.166% LearntReward ReplanningAccuracy 43.932% 63.476% 60.852% 63.476% 63.900% 63.476% 49.192% 53.427% KnownTransition ExpectedRewardIncrease -79.601% -11.429% -16.106% -11.429% -17.949% -11.429% -64.294% -31.812% KnownReward ReplanningAccuracy 95.620%(KnownRewardusedforReplanning) 75.571% 83.590% LearntTransition ExpectedRewardIncrease +0.709%(KnownRewardusedforReplanning) -0.617% -0.806% expected reward. We emphasize that given access solely to [4] R.S.Sutton,D.Mcallester,S.Singh,andY.Mansour,“Policygradient observations and action choices of the agent, and without methodsforreinforcementlearningwithfunctionapproximation,”inIn Advances in Neural Information Processing Systems 12. MIT Press, assuming any feature engineering of the reward, the QMDP 2000,pp.1057–1063. RCNN is able to achieve near-optimal levels of expected [5] C. G. Atkeson and J. C. Santamar´ıa, “A comparison of direct reward. Utilizing the ground truth transition with the learnt and model-based reinforcement learning,” in Proceedings of the 1997 IEEE International Conference on Robotics and Automation, reward, the QMDP RCNN performs marginally worse, with vol. 4. IEEE Press, 1997, pp. 3557–3564. [Online]. Available: a 63.476% accuracy and a −11.429% change in expected ftp://ftp.cc.gatech.edu/pub/people/cga/rl-compare.ps.gz reward. [6] J. Kober, J. A. Bagnell, and J. Peters, “Reinforcement learning in robotics:Asurvey,”I.J.RoboticRes.,vol.32,no.11,pp.1238–1274, The true efficacy of the BP RCNN and the QMDP RCNN 2013.[Online].Available:http://dx.doi.org/10.1177/0278364913495721 lie in their ability to learn accurate transition models and [7] B.Bakker,V.Zhumatiy,G.Gruener,andJ.Schmidhuber,“Quasi-online reward functions in partially observable settings, where they reinforcement learning for robots,” in Proceedings of the 2006 IEEE InternationalConferenceonRoboticsandAutomationICRA,2006. outperforms existing model-based approaches of naive and [8] T. Hester, M. Quinlan, and P. Stone, “Generalized model learning for weighted counting by significant margins, in terms of re- reinforcement learning on a humanoid robot.” in ICRA. IEEE, 2010, planning accuracy, transition errors, and expected reward. pp.2369–2374.[Online].Available:http://dblp.uni-trier.de/db/conf/icra/ icra2010.html#HesterQS10 Finally, we note that while all 3 defined RCNN architectures [9] J. Choi and K.-E. Kim, “Inverse reinforcement learning in partially are demonstrated for 2-D cases, these architectures can be observable environments,” J. Mach. Learn. Res., vol. 12, pp. 691– extendedtoanynumberofdimensions,andnumberofactions, 730, Jul. 2011. [Online]. Available: http://dl.acm.org/citation.cfm?id= 1953048.2021028 with suitable modifications to the convolution operation in [10] R.S.SuttonandA.G.Barto,IntroductiontoReinforcementLearning, higher dimensions. 1sted. Cambridge,MA,USA:MITPress,1998. [11] P. Abbeel and A. Y. Ng, “Apprenticeship learning via inverse rein- VI. CONCLUSION forcementlearning,”inInProceedingsoftheTwenty-firstInternational Inthispaper,wedefined3RCNNlikearchitectures,namely ConferenceonMachineLearning. ACMPress,2004. [12] A.Tamar,S.Levine,andP.Abbeel,“Valueiterationnetworks,”CoRR, the Value Iteration RCNN, the Belief Propagation RCNN, and vol. abs/1602.02867, 2016. [Online]. Available: http://arxiv.org/abs/ theQMDPRCNN,tofacilitateamorenaturalrepresentationof 1602.02867 solutions to model-based Reinforcement Learning. Together, [13] A. C. Ian Goodfellow, Yoshua Bengio, “Deep learning,” 2016, book in preparation for MIT Press. [Online]. Available: http: thesecontributionsspeeduptheplanningprocessinapartially //www.deeplearningbook.org observable environment, reducing the cost of replanning for [14] L.-J. Lin, “Self-improving reactive agents based on reinforcement model-based approaches. Given access to agent observations learning, planning and teaching,” Machine Learning, vol. 8, no. 3–4,pp.293–321,1992.[Online].Available:http://www.cs.ualberta.ca/ and action choices over time, the BP RCNN learns the transi- ∼sutton/lin-92.pdf tion model, and the QMDP RCNN learns the reward function, [15] M. Kearns and S. Singh, “Near-optimal reinforcement learning in and subsequently replans with learnt models to make near- polynomial time,” Mach. Learn., vol. 49, no. 2-3, pp. 209–232, Nov. 2002.[Online].Available:http://dx.doi.org/10.1023/A:1017984413808 optimal action choices. The proposed architectures were also [16] B. C. Stadie, S. Levine, and P. Abbeel, “Incentivizing exploration foundtooutperformexistingmodel-basedapproachesinspeed in reinforcement learning with deep predictive models,” CoRR, vol. and model accuracy. The natural symbiotic representation of abs/1507.00814, 2015. [Online]. Available: http://arxiv.org/abs/1507. 00814 planning and learning algorithms allows these approaches to [17] B. Bakker, “Reinforcement learning with long short-term memory,” in be extended to more complex tasks, by integrating them with InNIPS. MITPress,2002,pp.1475–1482. sophisticated perception modules. [18] J.Oh,X.Guo,H.Lee,R.L.Lewis,andS.P.Singh,“Action-conditional video prediction using deep networks in atari games,” CoRR, vol. REFERENCES abs/1507.08750, 2015. [Online]. Available: http://arxiv.org/abs/1507. 08750 [1] V.Mnih,K.Kavukcuoglu,D.Silver,A.Graves,I.Antonoglou,D.Wier- [19] C. G. Atkeson, “Nonparametric model-based reinforcement stra, and M. Riedmiller, “Playing atari with deep reinforcement learn- learning,” in Advances in Neural Information Processing Systems ing,”inNIPSDeepLearningWorkshop,2013. 10, [NIPS Conference, Denver, Colorado, USA, 1997], 1997, [2] H. van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning pp. 1008–1014. [Online]. Available: http://papers.nips.cc/paper/ with double q-learning,” CoRR, vol. abs/1509.06461, 2015. [Online]. 1476-nonparametric-model-based-reinforcement-learning Available:http://arxiv.org/abs/1509.06461 [20] S.Shalev-Shwartz,N.Ben-Zrihem,A.Cohen,andA.Shashua,“Long- [3] M. J. Hausknecht and P. Stone, “Deep recurrent q-learning for term planning by short-term prediction,” CoRR, vol. abs/1602.01580, partiallyobservablemdps,”CoRR,vol.abs/1507.06527,2015.[Online]. 2016.[Online].Available:http://arxiv.org/abs/1602.01580 Available:http://arxiv.org/abs/1507.06527