Near Optimal Behavior via Approximate State Abstraction David Abel† D. Ellis Hershkowitz† Michael L. Littman Brown University Carnegie Mellon University Brown University david [email protected] [email protected] [email protected] 7 Abstract 1 The combinatorial explosion that plagues planning and reinforcement learning (RL) algorithms can 0 be moderated using state abstraction. Prohibitively large task representations can be condensed such 2 that essential information is preserved, and consequently, solutions are tractably computable. However, n exact abstractions, which treat only fully-identical situations as equivalent, fail to present opportunities a forabstractioninenvironmentswherenotwosituationsareexactlyalike. Inthiswork,weinvestigateap- J proximatestateabstractions,whichtreatnearly-identicalsituationsasequivalent. Wepresenttheoretical 5 guarantees of the quality of behaviors derived from four types of approximate abstractions. Addition- 1 ally,weempiricallydemonstratethatapproximateabstractionsleadtoreductionintaskcomplexityand bounded loss of optimality of behavior in a variety of environments. ] G L 1 Introduction . s c Abstraction plays a fundamental role in learning. Through abstraction, intelligent agents may reason about [ onlythesalientfeaturesoftheirenvironmentwhileignoringwhatisirrelevant. Consequently,agentsareable 1 tosolveconsiderablymorecomplexproblemsthantheywouldbeabletowithouttheuseofabstraction. How- v ever,exact abstractions,whichtreatonlyfully-identicalsituationsasequivalent,requirecompleteknowledge 3 1 that is computationally intractable to obtain. Furthermore, often no two situations are identical, so exact 1 abstractionsareoftenineffective. Toovercometheseissues,weinvestigateapproximate abstractions thaten- 4 ableagentstotreatsufficientlysimilarsituationsasidentical. Thisworkcharacterizestheimpactofequating 0 “sufficiently similar” states in the context of planning and RL in Markov Decision Processes (MDPs). The . 1 remainder of our introduction contextualizes these intuitions in MDPs. 0 7 True Problem, MG Abstracted Problem, MA 1 v: Goal Abstraction, �(MG) Goal i X r a Bounded RL Error Solution to Abstracted art Problem, MA art St St Figure 1: We investigate families of approximate state abstraction functions that induce abstract MDP’s whose optimal policies have bounded value in the original MDP. ApreviousversionofthispaperwaspublishedintheProceedingsofthe33rdInternationalConferenceonMachineLearning, NewYork,NY,USA,2016. JMLR:W&CPvolume48. Copyright2016bytheauthor(s). †Thefirsttwoauthorscontributedequally. 1 Solving for optimal behavior in MDPs in a planning setting is known to be P-Complete in the size of the state space [28, 25]. Similarly, many RL algorithms for solving MDPs are known to require a number of samples polynomial in the size of the state space [31]. Although polynomial runtime or sample complexity may seem like a reasonable constraint, the size of the state space of an MDP grows super-polynomially with the number of variables that characterize the domain - a result of Bellman’s curse of dimensionality. Thus, solutions polynomial in state space size are often ineffective for sufficiently complex tasks. For instance, a robot involved in a pick-and-place task might be able to employ planning algorithms to solve for how to manipulate some objects into a desired configuration in time polynomial in the number of states, but the number of states it must consider grows exponentially with the number of objects with which it is working [1]. Thus,akeyresearchagendaforplanningandRLisleveragingabstractiontoreducelargestatespaces[2, 21, 10, 12, 6]. This agenda has given rise to methods that reduce ground MDPs with large state spaces to abstract MDPs with smaller state spaces by aggregating states according to some notion of equality or similarity. In the context of MDPs, we understand exact abstractions as those that aggregate states with equal values of particular quantities, for example, optimal Q-values. Existing work has characterized how exact abstractions can fully maintain optimality in MDPs [24, 8]. The thesis of this work is that performing approximate abstraction in MDPs by relaxing the state aggregationcriteriafromequalitytosimilarityachievespolynomiallyboundederrorintheresultingbehavior while offering three benefits. First, approximate abstractions employ the sort of knowledge that we expect a planning or learning algorithm to compute without fully solving the MDP. In contrast, exact abstractions often require solving for optimal behavior, thereby defeating the purpose of abstraction. Second, because of their relaxed criteria, approximate abstractions can achieve greater degrees of compression than exact abstractions. This difference is particularly important in environments where no two states are identical. Third, because the state aggregation criteria are relaxed to near equality, approximate abstractions are able to tune the aggressiveness of abstraction by adjusting what they consider sufficiently similar states. We support this thesis by describing four different types of approximate abstraction functions that pre- serve near-optimal behavior by aggregating states on different criteria: φ(cid:101)Q∗,ε, on similar optimal Q-values, φ(cid:101)model,ε, on similarity of rewards and transitions, φ(cid:101)bolt,ε, on similarity of a Boltzmann distribution over optimal Q-values, and φ(cid:101)mult,ε, on similarity of a multinomial distribution over optimal Q-values. Further- more, we empirically demonstrate the relationship between the degree of compression and error incurred on a variety of MDPs. This paper is organized as follows. In the next section, we introduce the necessary terminology and background of MDPs and state abstraction. Section 3 surveys existing work on state abstraction applied to sequentialdecisionmaking. Section5introducesourprimaryresult; boundsontheerrorguaranteedbyfour classes of approximate state abstraction. The following two sections introduce simulated domains used in experiments(Section6),andadiscussionofexperimentsinwhichweapplyoneclassofapproximateabstrac- tion to a variety of different tasks to empirically illustrate the relationship between degree of compression and error incurred (Section 7). 2 MDPs and Sequential Decision Making An MDP is a problem representation for sequential decision making agents, represented by a five-tuple: (cid:104)S,A,T,R,γ(cid:105). Here, S is a finite state space; A is a finite set of actions available to the agent; T denotes T(s,a,s(cid:48)), the probability of an agent transitioning to state s(cid:48) ∈ S after applying action a ∈ A in state s ∈ S; R(s,a) denotes the reward received by the agent for executing action a in state s; γ ∈ [0,1] is a discount factor that determines how much the agent prefers future rewards over immediate rewards. We assume without loss of generality that the range of all reward functions is normalized to [0,1]. The solution to an MDP is called a policy, denoted π :S (cid:55)→A. The objective of an agent is to solve for the policy that maximizes its expected discounted reward from any state, denoted π∗. We denote the expected discounted reward for following policy π from state s as the value of the state under that policy, Vπ(s). We similarly denote the expected discounted reward for taking 2 action a ∈ A and then following policy π from state s forever after as Qπ(s,a), defined by the Bellman Equation as: (cid:88) Qπ(s,a)=R(s,a)+γ T(s,a,s(cid:48))Qπ(s(cid:48),π(s(cid:48))). (1) s(cid:48) We let RMax denote the maximum reward (which is 1), and QMax denote the maximum Q value, which is RMax. The value function, V, defined under a given policy, denoted Vπ(s), is defined as: 1−γ Vπ(s)=Qπ(s,π(s)). (2) Lastly, we denote the value and Q functions under the optimal policy as V∗ or Vπ∗ and Q∗ or Qπ∗, respec- tively. For further background, see Kaelbling et al. [22]. 3 Related Work Several other projects have addressed similar topics. 3.1 Approximate State Abstraction Dean et al. [9] leverage the notion of bisimulation to investigate partitioning an MDP’s state space into clusters of states whose transition model and reward function are within ε of each other. They develop an algorithm called Interval Value Iteration (IVI) that converges to the correct bounds on a family of abstract MDPs called Bounded MDPs. Several approaches build on Dean et al. [9]. Ferns et al. [14, 15] investigated state similarity metrics for MDPs; they bounded the value difference of ground states and abstract states for several bisimulation metrics that induce an abstract MDP. This differs from our work which develops a theory of abstraction that bounds the suboptimality of applying the optimal policy of an abstract MDP to its ground MDP, covering four types of state abstraction, one of which closely parallels bisimulation. Even-Dar and Mansour [13] analyzed different distance metrics used in identifying state space partitions subject to ε-similarity, also providing value bounds (their Lemma 4) for ε-homogeneity subject to the L norm, which parallels our ∞ Claim2. Ortner[27]developedanalgorithmforlearningpartitionsinanonlinesettingbytakingadvantage of the confidence bounds for T and R provided by UCRL [3]. Hutter [18, 17] investigates state aggregation beyond the MDP setting. Hutter presents a variety of results for aggregation functions in reinforcement learning. Most relevant to our investigation is Hutter’s Theorem 8, which illustrates properties of aggregating states based on similar Q values. Hutter’s Theorem part (a) parallels our Claim: both bound the value difference between ground and abstraction states, and part (b) is analogous to our Lemma 1: both bound the value difference of applying the optimal abstraction policy in the ground, and part (c) is a repetition of the comment given by Li et al. [24] that Q∗ abstractions preserve the optimal value function. For Lemma 1, our proof strategies differ from Hutter’s, but the result is the same. Approximatestateabstractionhasalsobeenappliedtotheplanningproblem,inwhichtheagentisgiven amodelofitsenvironmentandmustcomputeaplanthatsatisfiessomegoal. Hostetleretal.[16]applystate abstractiontoMonteCarloTreeSearchandexpectimaxsearch,givingvalueboundsofapplyingtheoptimal abstractactioninthegroundtree(s),similarlytooursetting.DeardenandBoutilier[10]alsoformalizestate- abstractionforplanning,focusingonabstractionsthatarequicklycomputedandofferboundedvalue. Their primary analysis is on abstractions that remove negligible literals from the planning domain description, yielding value bounds for these abstractions and a means of incrementally improving abstract solutions to planningproblems.Jiangetal.[20]analyzeasimilarsetting, applyingabstractionstotheUpperConfidence Bound applied to Trees algorithm adapted for planning, introduced by Kocsis and Szepesv´ari [23]. Mandeletal.[26]advanceBayesianaggregationinRLtodefineThompsonClusteringforReinforcement Learning (TCRL), an extension of which achieves near-optimal Bayesian regret bounds. Jiang [19] analyze theproblemofchoosingbetweentwocandidateabstractions. Theydevelopanalgorithmbasedonstatistical 3 teststhattradesoftheapproximationerrorwiththeestimationerrorofthetwoabstractions,yieldingaloss bound on the quality of the chosen policy. 3.2 Specific Abstraction Algorithms Many previous works have targeted the creation of algorithms that enable state abstraction for MDPs. Andre and Russell [2] investigated a method for state abstraction in hierarchical reinforcement learning leveragingaprogramminglanguagecalledALISPthatpromotesthenotionofsafestateabstraction. Agents programmed using ALISP can ignore irrelevant parts of the state, achieving abstractions that maintain optimality. Dietterich [12] developed MAXQ, a framework for composing tasks into an abstracted hierarchy where state aggregation can be applied. Bakker and Schmidhuber [4] also target hierarchical abstraction, focusing on subgoal discovery. Jong and Stone [21] introduced a method called policy-irrelevance in which agentsidentify(online)whichstatevariablesmaybesafelyabstractedawayinafactored-stateMDP.Dayan and Hinton [7] develop “Feudal Reinforcement Learning” which presents an early form of hierarchical RL that restructures Q-Learning to manage the decomposition of a task into subtasks. For a more complete survey of algorithms that leverage state abstraction in past reinforcement-learning papers, see Li et al. [24], and for a survey of early works on hierarchical reinforcement learning, see Barto and Mahadevan [5]. 3.3 Exact Abstraction Framework Li et al. [24] developed a framework for exact state abstraction in MDPs. In particular, the authors defined five types of state aggregation functions, inspired by existing methods for state aggregation in MDPs. We generalize two of these five types, φQ∗ and φmodel, to the approximate abstraction case. Our generalizations are equivalent to theirs when exact criteria are used (i.e. ε=0). Additionally, when exact criteria are used our bounds indicate that no value is lost, which is one of core results of Li et al. [24]. Walsh et al. [34] build on the framework they previously developed by showing empirically how to transfer abstractions between structurally related MDPs. 4 Abstraction Notation We build upon the notation used by Li et al. [24], who introduced a unifying theoretical framework for state abstraction in MDPs. Definition 1 (M , M ): We understand an abstraction as a mapping from the state space of a ground G A MDP, M , to that of an abstract MDP, M , using a state aggregation scheme. Consequently, this mapping G A induces an abstract MDP. Let M =(cid:104)S ,A,T ,R ,γ(cid:105) and M =(cid:104)S ,A,T ,R ,γ(cid:105). G G G G A A A A Definition 2 (S , φ): The states in the abstract MDP are constructed by applying a state aggregation A function, φ, to the states in the ground MDP, S . More specifically, φ maps a state in the ground MDP to A a state in the abstract MDP: S ={φ(s)|s∈S }. (3) A G Definition 3 (G): Given a φ, each ground state has associated with it the ground states with which it is aggregated. Similarly, each abstract state has its constituent ground states. We let G be the function that retrieves these states: (cid:40) {g ∈S |φ(g)=φ(s)}, if s∈S , G G G(s)= (4) {g ∈S |φ(g)=s}, if s∈S . G A The abstract reward function and abstract transition dynamics for each abstract state are a weighted combination of the rewards and transitions for each ground state in the abstract state. 4 Definition 4 (ω(s)): We refer to the weight associated with a ground state, s ∈ S by ω(s). The only G restriction placed on the weighting scheme is that it induces a probability distribution on the ground states of each abstract state: (cid:88) ∀s∈SG ω(s)=1 AND ω(s)∈[0,1]. (5) s∈G(s) Definition 5 (R ): The abstract reward function R :S ×A(cid:55)→[0,1] is a weighted sum of the rewards A A A of each of the ground states that map to the same abstract state: (cid:88) R (s,a)= R (g,a)ω(g). (6) A G g∈G(s) Definition 6 (T ): The abstract transition function T : S ×A×S (cid:55)→ [0,1] is a weighted sum of the A A A A transitions of each of the ground states that map to the same abstract state: (cid:88) (cid:88) T (s,a,s(cid:48))= T (g,a,g(cid:48))ω(g). (7) A G g∈G(s)g(cid:48)∈G(s(cid:48)) 5 Approximate State Abstraction Here, we introduce our formal analysis of approximate state abstraction, including results bounding the error associated with these abstraction methods. In particular, we demonstrate that abstractions based on approximate Q∗ similarity (5.2), approximate model similarity (5.3), and approximate similarity between distributions over Q∗, for both Boltzmann (5.4) and multinomial (5.5) distributions induce abstract MDPs for which the optimal policy has bounded error in the ground MDP. We first introduce some additional notation. Definition 7 (π∗, π∗): We let π∗ : S → A and π∗ : S → A stand for the optimal policies in the A G A A G G abstract and ground MDPs, respectively. WeareinterestedinhowtheoptimalpolicyintheabstractMDPperformsinthegroundMDP. Assuch, we formally define the policy in the ground MDP derived from optimal behavior in the abstract MDP: Definition 8 (π ): Given a state s∈S and a state aggregation function, φ, GA G π (s)=π∗(φ(s)). (8) GA A We now define types of abstraction based on functions of state–action pairs. Definition 9 (φ(cid:101)f,ε): Given a function f :SG×A→R and a fixed non-negative ε∈R, we define φ(cid:101)f,ε as a type of approximate state aggregation function that satisfies the following for any two ground states s , s : 1 2 φ(cid:101)f,ε(s1)=φ(cid:101)f,ε(s2)→∀a|f(s1,a)−f(s2,a)|≤ε. (9) That is, when φ(cid:101)f,ε aggregates states, all aggregated states have values of f within ε of each other for all actions. Finally, we estliabsh notation to distinguish between the ground and abstract value (V) and action value (Q) functions. Definition 10 (QG, VG): Let QG = QπG∗ : SG×A → R and VG = VπG∗ : SG → R denote the optimal Q and optimal value functions in the ground MDP. Definition 11 (QA, VA): Let QA =QπA∗ :SA×A→R and VA =VπA∗ :SA →R stand for the optimal Q and optimal value functions in the abstract MDP. 5.1 Main Result We now introduce the main result of the paper. 5 Theorem 1. There exist at least four types of approximate state aggregation functions, φ(cid:101)Q∗,ε, φ(cid:101)model,ε, φ(cid:101)bolt,ε and φ(cid:101)mult,ε, for which the optimal policy in the abstract MDP, applied to the ground MDP, has suboptimality bounded polynomially in ε: ∀ VπG∗(s)−VπGA(s)≤2εη (10) s∈SG G G f Where η differs between abstraction function families: f 1 ηQ∗ = (1−γ)2 1+γ(|S |−1) G η = model (1−γ)3 (cid:16) (cid:17) |A| +εk +k 1−γ bolt bolt η = bolt (1−γ)2 (cid:16) (cid:17) |A| +k 1−γ mult η = mult (1−γ)2 For η and η , we also assume that the difference in the normalizing terms of each distribution is bolt mult bounded by some non-negative constant, k ,k ∈R, of ε: mult bolt (cid:12) (cid:12) (cid:12) (cid:12) (cid:12)(cid:88) (cid:88) (cid:12) (cid:12) QG(s1,ai)− QG(s2,aj)(cid:12)≤kmult×ε (cid:12) (cid:12) (cid:12) i j (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) (cid:12)(cid:12)(cid:88)eQG(s1,ai)−(cid:88)eQG(s2,aj)(cid:12)(cid:12)≤kbolt×ε (cid:12) (cid:12) (cid:12) i j (cid:12) Naturally,thevalueboundofEquation10ismeaninglessfor2εη ≥ RMax = 1 ,sincethisisthemaximum f 1−γ 1−γ possible value in any MDP (and we assumed the range of R is [0,1]). In light of this, observe that for ε=0, alloftheaboveboundsareexactly0. Anyvalueofεinterpolatedbetweenthesetwopointsachievesdifferent degrees of abstraction, with different degrees of bounded loss. We now introduce each approximate aggregation family and prove the theorem by proving the specific value bound for each function type. 5.2 Optimal Q Function: φ(cid:101)Q∗,ε WeconsideranapproximateversionofLietal.[24]’sφQ∗. Inourabstraction,statesareaggregatedtogether when their optimal Q-values are within ε. Definition 12 (φ(cid:101)Q∗,ε): An approximate Q function abstraction has the same form as Equation 9: φ(cid:101)Q∗,ε(s1)=φ(cid:101)Q∗,ε(s2)→∀a|QG(s1,a)−QG(s2,a)|≤ε. (11) Lemma 1. When a φ(cid:101)Q∗,ε type abstraction is used to create the abstract MDP: ∀ VπG∗(s)−VπGA(s)≤ 2ε . (12) s∈SG G G (1−γ)2 Proof of Lemma 1: We first demonstrate that Q-values in the abstract MDP are close to Q-values in the ground MDP (Claim 1). We next leverage Claim 1 to demonstrate that the optimal action in the abstract MDPisnearlyoptimalinthegroundMDP(Claim2). Lastly,weuseClaim2toconcludeLemma1(Claim3). 6 Claim 1. Optimal Q-values in the abstract MDP closely resemble optimal Q-values in the ground MDP: ε ∀sG∈SG,a|QG(sG,a)−QA(φ(cid:101)Q∗,ε(sG),a)|≤ 1−γ. (13) Consider a non-Markovian decision process of the same form as an MDP, M = (cid:104)S ,A ,R ,T ,γ(cid:105), T T G T T parameterized by integer an T, such that for the first T time steps the reward function, transition dynamics and state space are those of the abstract MDP, M , and after T time steps the reward function, transition A dynamics and state spaces are those of M . Thus, G (cid:40) S if T =0 S = G T S o/w A (cid:40) R (s,a) if T =0 R (s,a)= G T R (s,a) o/w A T (s,a,s(cid:48)) if T =0 T (s,a,s(cid:48))= G(cid:80) [TG(g,a,s(cid:48))ω(g)] if T =1 T gT∈G((ss,)a,s(cid:48)) o/w A The Q-value of state s in S for action a is: T Q (s,a) if T =0 (cid:80)G [Q (g,a)ω(g)] if T =1 Q (s,a)= G (14) T gR∈G((ss),a)+σ (s,a) o/w A T−1 where: (cid:88) σ (s,a)=γ T (s,a,s (cid:48))maxQ (s (cid:48),a(cid:48)). T−1 A A T−1 A a(cid:48) sA(cid:48)∈SA We proceed by induction on T to show that: T−1 (cid:88) ∀ |Q (s ,a)−Q (s ,a)|≤ εγt, (15) T,sG∈SG,a T T G G t=0 where sT =sG if T =0 and sT =φ(cid:101)Q∗,ε(sG) otherwise. Base Case: T =0 When T =0, Q =Q , so this base case trivially follows. T G Base Case: T =1 By definition of Q , we have that Q is T 1 (cid:88) Q (s,a)= [Q (g,a)ω(g)]. 1 G g∈G(s) Sinceallco-aggregatedstateshaveQ-valueswithinεofoneanotherandω(g)inducesaconvexcombination, Q (s ,a)≤εγt+ε+Q (s ,a) 1 T G G 1 (cid:88) ∴|Q (s ,a)−Q (s ,a)|≤ εγt. 1 T G G t=0 7 Inductive Case: T >1 We assume as our inductive hypothesis that: T−2 (cid:88) ∀ |Q (s ,a)−Q (s ,a)|≤ εγt. sG∈SG,a T−1 T G G t=0 Considerafixedbutarbitrarystate,sG ∈SG,andfixedbutarbitraryactiona. SinceT >1,sT isφ(cid:101)Q∗,ε(sG). By definition of Q (s ,a), R , T : T T A A (cid:88) (cid:88) QT(sT,a)= ω(g) ×RG(g,a)+γ TG(g,a,g(cid:48))maxQT−1(g(cid:48),a(cid:48)). a(cid:48) g∈G(sT) g(cid:48)∈SG Applying our inductive hypothesis yields: (cid:20) T−2 (cid:21) (cid:88) (cid:88) (cid:88) Q (s ,a)≤ ω(g)× R (g,a) +γ T (g,a,g(cid:48))max(Q (g(cid:48),a(cid:48))+ εγt) . T T G G G a(cid:48) g∈G(sT) g(cid:48)∈SG t=0 Since all aggregated states have Q-values within ε of one another: T−2 (cid:88) Q (s ,a)≤γ εγt+ε+Q (s ,a). T T G G t=0 Sinces isarbitraryweconcludeEquation15. AsT →∞,(cid:80)T−1εγt → ε bythesumofinfinitegeometric G t=0 1−γ series and Q →Q . Thus, Equation 15 yields Claim 1. T A Claim 2. Consider a fixed but arbitrary state, sG ∈SG and its corresponding abstract state sA =φ(cid:101)Q∗,ε(sG). Let a∗ stand for the optimal action in s , and a∗ stand for the optimal action in s : G G A A a∗ =argmaxQ (s ,a), a∗ =argmaxQ (s ,a). G G G A A A a a The optimal action in the abstract MDP has a Q-value in the ground MDP that is nearly optimal: 2ε V (s )≤Q (s ,a∗)+ . (16) G G G G A 1−γ By Claim 1, ε V (s )=Q (s ,a∗)≤Q (s ,a∗)+ . (17) G G G G G A A G 1−γ By the definition of a∗, we know that A ε ε Q (s ,a∗)+ ≤Q (s ,a∗)+ . (18) A A G 1−γ A A A 1−γ Lastly, again by Claim 1, we know ε 2ε Q (s ,a∗)+ ≤Q (s ,a∗)+ . (19) A A A 1−γ G g A 1−γ Therefore, Equation 16 follows. Claim 3. Lemma 1 follows from Claim 2. 8 Consider the policy for M of following the optimal abstract policy π∗ for t steps and then following the G A optimal ground policy π∗ in M : G G (cid:40) π∗(s) if t=0 π (s)= G (20) A,t π (s) if t>0 GA For t>0, the value of this policy for s ∈S in the ground MDP is: G G VπA,t(s )=R (s,π (s ))+ γ (cid:88) T (s ,a,s (cid:48))VπA,t−1(s (cid:48)). G G G A,t G G G G G G sG(cid:48)∈SG For t=0, VπA,t(s ) is simply V (s ). G G G G We now show by induction on t that t ∀ V (s )≤VπA,t(s )+(cid:88)γi 2ε . (21) t,sG∈Sg G G G G 1−γ i=0 Base Case: t=0 By definition, when t=0, VπA,t =V , so our bound trivially holds in this case. G G Inductive Case: t>0 Consider a fixed but arbitrary state s ∈S . We assume for our inductive hypothesis that G G t−1 V (s )≤VπA,t−1(s )+(cid:88)γi 2ε . (22) G G G G 1−γ i=0 By definition, VπA,t(s )=R (s,π (s ))+γ(cid:88)T (s ,a,s (cid:48))VπA,t−1(s (cid:48)). G G G A,t G G G G G G g(cid:48) Applying our inductive hypothesis yields: (cid:32) t−1 (cid:33) VπA,t(s )≥R (s ,π (s ))+γ(cid:88)T (s ,π (s ),s (cid:48)) V (s (cid:48))−(cid:88)γi 2ε . G G G G A,t G G G A,t G G G G 1−γ sG(cid:48) i=0 Therefore, t−1 VπA,t(s )≥−γ(cid:88)γi 2ε +Q (s ,π (s )). G G 1−γ G G A,t G i=0 Applying Claim 2 yields: t−1 VπA,t(s )≥−γ(cid:88)γi 2ε − 2ε +V (s ) G G 1−γ 1−γ G G i=0 t ∴ V (s )≤VπA,t(s )+(cid:88)γi 2ε . G G G G 1−γ i=0 Since s was arbitrary, we conclude that our bound holds for all states in S for the inductive case. Thus, G G from our base case and induction, we conclude that t ∀ VπG∗(s )≤VπA,t(s )+(cid:88)γi 2ε . (23) t,sG∈Sg G G G G 1−γ i=0 Notethatast→∞,(cid:80)t γi 2ε → 2ε bythesumofinfinitegeometricseriesandπ (s)→π . Thus, i=0 1−γ (1−γ)2 A,t GA we conclude Lemma 1. 9 5.3 Model Similarity: φ(cid:101) model,ε Now, consider an approximate version of Li et al. [24]’s φ , where states are aggregated together when model their rewards and transitions are within ε. Definition 13 (φ(cid:101)model,ε): We let φ(cid:101)model,ε define a type of abstraction that, for fixed ε, satisfies: φ(cid:101)model,ε(s1)=φ(cid:101)model,ε(s2)→ (cid:12) (cid:12) (cid:12) (cid:12) ∀a|RG(s1,a)−RG(s2,a)|≤ε AND ∀sA∈SA(cid:12)(cid:12)(cid:12) (cid:88) [TG(s1,a,sG(cid:48))−TG(s2,a,sG(cid:48))](cid:12)(cid:12)(cid:12)≤ε. (24) (cid:12)sG(cid:48)∈G(sA) (cid:12) Lemma 2. When SA is created using a φ(cid:101)model,ε type: ∀ VπG∗(s)−VπGA(s)≤ 2ε+2γε(|SG|−1). (25) s∈SG G G (1−γ)3 Proof of Lemma 2: Let B be the maximum Q-value difference between any pair of ground states in the same abstract state for φ(cid:101)model,ε: B = max |Q (s ,a)−Q (s ,a)|, G 1 G 2 s1,s2,a where s ,s ∈G(s ). First, we expand: 1 2 A (cid:12) (cid:20) (cid:21)(cid:12) B = max(cid:12)(cid:12)RG(s1,a)−RG(s2,a) +γ (cid:88) (TG(s1,a,sG(cid:48))−TG(s2,a,sG(cid:48)))maxQG(sG(cid:48),a(cid:48)) (cid:12)(cid:12) (26) s1,s2,a(cid:12) sG(cid:48)∈SG a(cid:48) (cid:12) Since difference of rewards is bounded by ε: (cid:20) (cid:21) (cid:88) (cid:88) B ≤ε+γ (T (s ,a,s (cid:48)) −T (s ,a,s (cid:48)))maxQ (s (cid:48),a(cid:48)) . (27) G 1 G G 2 G G G a(cid:48) sA∈SAsG(cid:48)∈G(sA) By similarity of transitions under φ(cid:101)model,ε: (cid:88) B ≤ε+γQMax ε≤ε+γ|S |εQMax. G sA∈SA Recall that QMax = RMax, and we defined RMax=1: 1−γ ε+γ(|S |−1)ε B ≤ G . 1−γ Since the Q-values of ground states grouped under φ(cid:101)model,ε are strictly less than B, we can understand φ(cid:101)model,ε as a type of φ(cid:101)Q∗,B. Applying Lemma 1 yields Lemma 2. 5.4 Boltzmann over Optimal Q: φ(cid:101) bolt,ε Here, we introduce φ(cid:101)bolt,ε, which aggregates states with similar Boltzmann distributions on Q-values. This type of abstractions is appealing as Boltzmman distributions balance exploration and exploitation [32]. We findthistypeparticularlyinterestingforabstractionpurposesas,unlikeφ(cid:101)Q∗,ε,itallowsforaggregationwhen Q-value ratios are similar but their magnitudes are different. 10