The Verification of Probabilistic Systems Under Memoryless Partial-Information Policies is Hard∗ Luca de Alfaro Department of Electrical Engineering and Computer Sciences, University of California at Berkeley. Email: [email protected] Abstract Several models of probabilistic systems comprise both probabilistic and nondeterministic choice. In such models, the resolution of nondeterministic choices is mediated by the concept of policies (sometimes called adversaries). A policy is a criterion for choosing among nonde- terministic alternatives on the basis of the past sequence of states of the system. By fixing the resolution of nondeterministic choice, a policy reduces the system to an ordinary stochastic system, thus making it possible to reason about the probability of events of interest. A partial information policy is a policy that can observe only a portion of the system state, and that must base its choices on finite sequences of such partial observations. We argue that in order to obtain accurate estimates of the worst-case performance of a probabilistic system, it would often be desirable to consider partial-information policies. However, we show that evenwhenconsideringmemorylesspartial-informationpolicies,theproblemofdecidingwhether the system can stay forever with positive probability in a given subset of states becomes NP- complete. As a consequence, many verification problems that can be solved in polynomial time under perfect-information policies, such as the model-checking of pCTL or the computa- tion of the worst-case long-run average outcome of tasks, become NP-hard under memoryless partial-information policies. Onthe positive side, we show thatthe worst-case long-runaverage outcomeoftasksunderundermemorylesspartial-informationpoliciescanbecomputedbysolv- ing a nonlinear programming problem, opening the way to the use of numerical approximation algorithms. 1 Introduction In several models of probabilistic systems, probabilistic and nondeterministic choice coexist. While probabilistic choice provides a statistical characterization of the system behavior, nondeterminism is used to model concurrency [Var85, PZ86, SL94], and lack of knowledge of transition probabilities [Seg95, dA97] and transition rates [dA98b]. In such a model, the probability of events depends on the way the nondeterministic choices are resolved during the behavior of the system. To assign a ∗This paper appeared in the Proceedings of the Workshop on Probabilistic Methods in Verification, published as Technical Report CSR-99-9, pages 19–32, University of Birmingham, 1999. This research was supported in part by theNSFCAREERawardCCR-9501708,bytheDARPA(NASAAmes)grantNAG2-1214,bytheDARPA(Wright- PattersonAFB)grantF33615-98-C-3614,bytheAROMURIgrantDAAH-04-96-1-0341,andbytheGigascaleSilicon Research Center. 1 Report Documentation Page Form Approved OMB No. 0704-0188 Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information, including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington VA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if it does not display a currently valid OMB control number. 1. REPORT DATE 3. DATES COVERED 1999 2. REPORT TYPE 00-00-1999 to 00-00-1999 4. TITLE AND SUBTITLE 5a. CONTRACT NUMBER The Verification of Probabilistic Systems Under Memoryless 5b. GRANT NUMBER Partial-Information Policies is Hard 5c. PROGRAM ELEMENT NUMBER 6. AUTHOR(S) 5d. PROJECT NUMBER 5e. TASK NUMBER 5f. WORK UNIT NUMBER 7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) 8. PERFORMING ORGANIZATION University of California at Berkeley,Department of Electrical REPORT NUMBER Engineering and Computer Sciences,Berkeley,CA,94720-1770 9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSOR/MONITOR’S ACRONYM(S) 11. SPONSOR/MONITOR’S REPORT NUMBER(S) 12. DISTRIBUTION/AVAILABILITY STATEMENT Approved for public release; distribution unlimited 13. SUPPLEMENTARY NOTES 14. ABSTRACT 15. SUBJECT TERMS 16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF 18. NUMBER 19a. NAME OF ABSTRACT OF PAGES RESPONSIBLE PERSON a. REPORT b. ABSTRACT c. THIS PAGE 13 unclassified unclassified unclassified Standard Form 298 (Rev. 8-98) Prescribed by ANSI Std Z39-18 probability to the events, it is customary to use the notion of policy [Bel57], closely related to the schedulers of [Var85] and the adversaries of [SL94]. Whenever the choice among nondeterminis- tic alternatives arises, a policy dictates the probability of choosing each alternative, possibly as a function of the past sequence of states visited by the system. Hence, once the policy is specified, the nondeterminism present in the system is resolved, and the system is thus reduced to a purely probabilistic system. In the statement of verification problems, nondeterminism is usually assigned a demonic role: a property is considered to hold iff it holds under any possible resolution of nonde- terministic choice, or equivalently, under any policy. Perfect-information and partial-information policies correspond to demons with different observation powers. A perfect-information policy is a policy that can observe the complete description of the system state, and that can select among the nondeterministic alternatives on the basis of the finite sequence of states traversed by the system. In contrast, a partial-information policy can only observe part of the system state, and it must select among the alternatives on the basis of finite sequences of such incomplete observations. To understand why partial-information policies can lead to more accurate estimates of the worst-case system performance, consider the following example. Consider a telecommunication network that routes phone calls between nodes, and consider two users u and u , attached to two 1 2 nodes of the network. When user u tries to call user u , he is either connected to user u , or 1 2 2 he receives a busy signal, indicating that there are no connections available to route the call. We intend to model the system in order to study the long-run average fraction of successful calls. To simplify the example, we assume that we have enough statistical information about the network to model the number of connections available between any pair of nodes as a purely probabilistic process, without any nondeterminism. On the other hand, we do not have precise information on when our particular user u wishes to call u , so that we model the choice to place a call as a 1 2 nondeterministic choice. From the point of u , a state s of the system consists of four components 1 s = (s[c],s[u ],s[r],s[n]), where: 1 • s[c] is a portion of the state visible to everyone (e.g., the current time of the day); • s[u ] ∈ {idle,trying,connected} describes the state of u ; 1 1 • s[n] ∈ {0,...,N} is the number of available connections for calls between u and u ; if 1 2 s[n] = 0, then no call from u to u can take place. 1 2 • s[t]isaportionofthestatevisibleonlytothenetwork(e.g., thestateofothercommunication links and routing tables). If s[u ] = idle, there is a nondeterministic choice between staying at idle (i.e., not placing a call), 1 or going to trying (i.e., dialing the number and waiting for the connection to be established). If s[u ] = trying and s[n] > 0, then the connection is established, and u proceeds to connected; if 1 1 s[u ] = trying and s[n] = 0, user u gets a busy signal, and returns to idle. If s[u ] = connected, 1 1 1 user u can either remain in this state, or hang up and proceed to idle. In Section 3 we present 1 the formal model of a system similar to the one described above. If we model the communication system as indicated above, and study the system under perfect- information policies, we obtain that in the worst case the long-run fraction of successful calls of is 0 — independently of how many free connections there are on average between u and u ! In fact, 1 2 at states where s[u ] = idle the choice of whether to stay at idle or try to place a call (going to 1 connected) is nondeterministic. In such a state, a perfect-information policy can look at the value 2 of s[n] before deciding whether to place a call. Hence, a worst-case perfect-information policy will place a call only from states where s[n] = 0, i.e., where all the connections between u and u are 1 2 busy. While the value 0 is indeed a lower bound for the long-run average fraction of successful calls from u to u , this answer takes an unrealistic, and overly pessimistic, view of the system. In 1 2 fact, the use of nondeterminism to model the decision of u to place a call is intended to model 1 an unknown dependency between the frequency with which u places calls, and global information 1 such as the time of the day. It is unrealistic to assume that u can base its decision to place a 1 call on the number of free connections, since such information would not be available to u in a 1 real telecommunication system. In order to obtain a more realistic worst-case analysis, we need to consider partial-information policies, which can base their decision of whether to place a call on the state of u and on global information in s[c], but not on information that is internal to the 1 network, such as the number of free connections between u and u . 1 2 The telecommunication example also suggests why the need for partial-information policies is more felt in the analysis of probabilistic systems than in the analysis of purely nondeterministic ones. In a purely nondeterministic system, we are generally interested in the possibility of events, rather than in their frequency. Hence, all finite sequence of events, however rare they might be, are taken into account for establishing a property. In purely nondeterministic systems, the concept of fairness is normally used instead of partial information to guard against an infinite number of unfortunate coincidences, such as trying to place a call always only when no connection is free. In the above example, we can rule out such behaviors by adding a fairness condition to the system, requiring that the choice between placing a call and staying in idle should be (strongly) fair at all states. Even though fairness and partial information are not equivalent, fairness is preferred because it is amenable to simpler verification methods. However, fairness is not a substitute for partialinformationinthestudyofthefrequency ofevents. Forexample,theabovefairnesscondition on placing calls does not ensure that the frequency of placing calls is not influenced by the number of free connections. In fact, even with this fairness condition, the worst-case long-run average fraction of calls that are successful is arbitrarily close to 0: the fairness condition is still satisfied if a fraction of the calls smaller than 1, but arbitrarily close to 1, is placed from states where no connection is free. We note that fairness in probabilistic systems can be used as a surrogate for partial information when the specification languages cannot refer to the frequency of events, as is the case for the logic pCTL [BK98, dA99]. OurmodelforsystemswithbothprobabilisticandnondeterministicchoiceisthatofMarkov de- cisionprocesses [Bel57,Der70],whicharecloselyrelatedtoseveralmodelsproposedintheliterature [Var85, PZ86, SL94]. We consider the confinement problem, consisting in deciding whether there is a policy in a given class of policies that enables us to stay forever in a specified subset of states with probability greater than 0. While the confinement problem is solvable in polynomial-time for gen- eralpolicies, from[Rei84]wehavethattheconfinementproblemisEXPTIME-completeforpartial- information policies. We show that the confinement problem is NP-complete even if we restrict our attention to memoryless and limit-memoryless partial-information policies. Limit-memoryless partial-information are policies whose state-action frequencies converges to that of a memoryless partial-information policy. These results imply that the model-checking problem for pCTL specifi- cations, and the problem of computing the worst-case long-run average outcome of tasks [dA98a], which can be solved in polynomial time under perfect-information policies, are EXPTIME-hard under partial-information policies, and NP-hard under memoryless and limit-memoryless partial- information policies. On the positive side, we show that the worst-case long-run average outcome 3 of tasks under memoryless and limit-memoryless partial-information policies can be computed by solving a nonlinear optimization problem. The paper is organized as follows. In Section 2 we describe Markov decision processes and partial-information policies, and we define the confinement problem. In Section 3 we present the machinery for defining the long-run average outcome of tasks, and we describe a simple telecom- munication example that helps to motivate the consideration of partial-information policies. In Section 4 we present lower-bound results on the complexity of pCTL model checking and compu- tation of long-run average outcomes under partial information. Section 5 presents the optimization problem that enables the computation of the worst-case long-run average outcome of tasks under memoryless partial-information policies, and Section 6 contains some concluding comments. 2 Markov Decision Processes and Partial-Information Policies Our model for probabilistic systems is a Markov decision process (MDP). An MDP is a general- ization of a Markov chain in which a set of possible actions is associated with each state. To each state-action pair corresponds a probability distribution on the states, which is used to select the successor state [Der70]. Markov decision processes are closely related to the probabilistic automata of [Rab63], the concurrent Markov chains of [Var85], and the simple probabilistic automata of [SL94, Seg95]. Given a countable set C we denote by D(C) the set of probability distributions over C, i.e. the set of functions f : C 7→ [0,1] such that P f(x) = 1. An MDP P = (S,Acts,A,p) x∈C consists of the following components: • A set S of states. • A set Acts of actions. • A function A : S 7→ 2Acts, which associates with each s ∈ S a finite set A(s) ⊆ Acts of actions available at s. • A function p : S × Acts 7→ D(S), which associates with each s,t ∈ S and a ∈ A(s) the probability p(s,a)(t) of a transition from s to t when action a is selected. We measure the complexity of the algorithms as a function of the size of the MDP P, defines as P |A(s)|. A path of an MDP is an infinite sequence s ,a ,s ,a ,... of alternating states s∈S 0 0 1 1 and actions, such that s ∈ S, a ∈ A(s ) and p(s ,a )(s ) > 0 for all i ≥ 0. For i ≥ 0, the i i i i i i+1 sequence is constructed by iterating a two-phase selection process. First, an action a ∈ A(s ) is i i selectednondeterministically; second, thesuccessorstates ischosenaccordingtotheprobability i+1 distribution p(s ,a). Given a path s ,a ,s ,a ,... and k ≥ 0, we denote by X , Y its generic i 0 0 1 1 k k k-th state s and its generic k-th action a , respectively. For n ≥ 0, we call a finite portion k k s ,a ,s ,...,s of path a finite path prefix. 0 0 1 n Let S+ be the set of non-empty finite sequences of states. A (perfect-information) policy π is a mapping π : S+ 7→ D(Acts), which associates with each sequence of states s¯: s ,s ,...,s ∈ S+ 0 1 n and each a ∈ A(s ) the probability π(s¯)(a) of choosing a after following the sequence of states n s¯. We require that π(s¯)(a) > 0 implies a ∈ A(s ): a policy can choose only among the actions n that are available at the state where the choice is made. According to this definition, policies are randomized, differently from the schedulers of [Var85, PZ86], which are deterministic. We indicate 4 with Π the set of all policies. We say that a policy π is memoryless if π(s¯,s) = π(t¯,s) for all s¯,t¯∈ S∗ and all s ∈ S. To define partial-information policies, we define partial-information relations. A partial-infor- mation relation for an MDP P = (S,Acts,A,p) is an equivalence relation ∼ ⊆ S ×S such that for all s ∼ t, we have A(s) = A(t). If two states are related by ∼, then the states cannot be distinguished by a partial-information policy. The condition on ∼ ensures that if two states cannot bedistinguishedbythepolicy, thenthepolicycanchooseamongthesameactionsatthetwostates. Given two sequences of states s¯ : s ,...,s and t¯: t ,...,t , with m,n ≥ 0, we write s¯ ∼ t¯iff 0 n 0 m m = n and s ∼ t for all 1 ≤ i ≤ n. Given an MDP P and a partial-information relation ∼ for P, i i we say that a policy π is partial-information iff π(s¯) = π(t¯) for all s¯,t¯∈ S+ such that s¯∼ t¯. If the relation ∼ has been fixed, we denote by PIPol the set of partial-information policies with respect to ∼. Once a policy π has been selected, the Markov decision process is reduced to a purely prob- abilistic process, and it becomes possible to define the probabilities of events. In particular, the probability of following a finite path prefix s ,a ,s ,a ,...,s under policy π ∈ Π is given by 0 0 1 1 n n−1 Prπ (X = s ∧Y = a ∧···∧X = s ) = Y p(s ,a )(s )π(s ,...,s )(a ) . s0 0 0 0 0 n n i i i+1 0 i i i=0 To extend this probability measure to subsets of infinite paths, for every state s ∈ S we denote by Θ the set of (infinite) paths having s as initial state. Given two paths (or path prefixes) θ s 1 and θ , we denote by θ (cid:22) θ the fact that θ is a prefix of θ . Following the classical definition 2 1 2 1 2 of [KSK66], we let Bs ⊆ 2Θs be the σ-algebra of measurable subsets of Θs, defined as the smallest algebra that contains all the cylinder sets {θ ∈ Θ | σ (cid:22) θ}, for σ that ranges over all finite path s prefixes, and that is closed under complementation and countable unions (and hence also countable intersections). The elements of B are called events, and they are the measurable sets of paths to s which we will associate a probability. For A ∈ S B , we write Prπ(A) to denote the probability s∈S s s of event A∩B starting from the initial state s ∈ S under policy π, and we write Eπ{f} to denote s s the expectation of the random function f : Θ 7→ IR from initial state s under policy π. s The Confinement Problem Given an MDP P = (S,Acts,A,p), a subset U ⊆ S, a state s ∈ S, and a class of policies C, the confinement problem consists in determining whether there is a policy π ∈ C such that (cid:16) (cid:17) Prπ ∀k ≥ 0.X ∈ U > 0 . (1) s k It is known that for C = Π the confinement problem can be solved in polynomial-time with efficient graph algorithms. We shall study the complexity of this problem for partial-information policies. We note that the confinement problem is at the heart of several algorithms for the model-checking of pCTL* specifications [CY95, BdA95, BK98]; hence, the complexity of the confinement problem directly affects the complexities of these model-checking problems. 3 Long-Run Average Outcome The long-run average properties considered in [dA98a, dA98b] refer to the average outcome of a task, which is repeated infinitely often during the behavior of the system. During a task, a certain 5 amount of outcome is accrued, indicating for instance the time required to complete the task, or the successful or unsuccessful completion of the task. In our telecommunication example, a task consistsintryingtoplaceacall; theoutcomeaccruedis1ifthecallsucceeds, and0ifnoconnection is available. The long-run average outcome of this task is equal to the long-run average fraction of successful calls. Given an MDP P = (S,Acts,A,p), we specify tasks and outcomes using two labelings r and w. The labeling w : S×Acts 7→ {0,1} associates with each s ∈ S and each a ∈ A(s) the value 1 if taking action a at s signals the completion of a task, and value 0 otherwise. The labeling r : S×Acts 7→ IR associates an outcome to each state-action pair. We say that a policy π is proper from s ∈ S such that n−1 lim EηnXw(X ,Y )o = ∞ , s k k n→∞ k=0 indicating that the system performs an infinite expected number of experiments from s. We denote by PropPol(s) ⊆ Π the set of proper policies from s. Given s ∈ S and a proper policy π ∈ PropPol(s), we define the long-run average outcome vπ of π from s by s n o Eη Pn−1r(X ,Y ) vπ = liminf s k=0 k k . s n→∞ EηnPn−1w(X ,Y )o s k=0 k k Given a class of policies C, let PropS(C) = {s ∈ S | PropPol(s)∩C 6= ∅} be the set of states with at least one proper policy belonging to the class. The minimum long-run average outcome problem consists in computing n o v− = min inf vπ , C s s∈PropS(C) π∈C∩PropPol(s) assuming that PropS(C) 6= ∅. If C = Π, this problem can be solved in polynomial time by a reduction to linear programming [dA97, dA98a]. A Simple Telecommunication Example The following example presents a discrete-time model of a telecommunication system similar to the one discussed in the introduction. The example illustrates the use of nondeterminism for the representation of approximate knowledge of transition probabilities. Consider a telecommunica- tion network, in which there is a total number n > 0 of connections available, and there is a distinguished user u that tries intermittently to place calls. We model this system by the MDP 1 P = (S,Acts,A,p), whereS = {idle,trying,connected}×{0,...,n}×{0,1}. Inastatehx,k,ii ∈ S, x is the state of the user u , k is the number of busy connections, and i specifies whether it is u ’s 1 1 turn (i = 0), or the network’s turn (i = 1) of updating the state. The actions are Acts = {a,b}, and we have A(s) = {a,b} for every s ∈ S. The number of free connections performs a random walk between 0 and n. From state hx,k,1i, under either action a or b, we update the state as follows: • If 2 ≤ k < n, then we go with probability 1/2 to hx,k −1,0i and with probability 1/2 to hx,k+1,0i. • If k = 1 and x 6= connected, then we go with probability 1/2 to hx,k − 1,0i and with probability 1/2 to hx,k+1,0i. 6 • If k = 1 and x = connected, then we go to hx,k+1,0i. • If k = 0, then we go to hx,k+1,0i. • If k = n, then we go to hx,k−1,0i. From state hidle,k,0i, if the action a is chosen we proceed to state hidle,k,1i; if action b is chosen, we proceed to state htrying,k,0i. From state htrying,k,0i, under both actions a and b we proceed to state hconnected,k + 1,1i if k < n, and to state hidle,k,1i if k = n. Finally, from state hconnected,k,0i under both actions a and b we proceed to state hidle,k−1,1i: for simplicity, we consider only unit-duration phone calls. To measure the long-run average fraction of successful calls, we define the labels r and w as follows. We let w(htrying,k,0i,ξ) = 1 for all 0 ≤ k ≤ n and all ξ ∈ {a,b}, so that w counts the number of attempted calls. We let r(htrying,n,0i,ξ) = 0 and r(htrying,k,0i,ξ) = 1 for all 0 ≤ k < n and all ξ ∈ {a,b}, so that r counts the number of successful calls. It is easy to check that vs is equal to the fraction of successful calls from the initial state s under policy π. The worst-case π value of this fraction under perfect-information policies is v− = 0. This worst-case value arises Π when the user u chooses action b whenever there are no free connections, and action a otherwise. 1 This is clearly an unrealistic worst-case value. A better estimate can be obtained by introducing a partial-information relation ∼ defined by hi,k ,ji ∼ hi,k ,ji for all i ∈ {idle,trying,connected}, 1 2 all 0 ≤ k ,k ≤ n, and all j = 0,1. This partial-information relation prevents the user u from 1 2 1 selecting actions a and b on the basis of the number of free connections. The worst-case fraction of successful calls under partial-information policies v− provides a more realistic estimate of the PIPol performance of the system. We note that introducing a partial-visibility relation is equivalent to assuming that there are no factors external to the model that can influence the policies differently at states related by the partial-visibility relation. While u most likely cannot base his decision of calling on the number 1 of free connections, there might be external factors that make it more likely for u to call when 1 more connections are busy. For example, in countries where soccer is popular, more people place telephone calls during the mid-game intervals than during the game proper. If such external factors are not accounted for, then the worst-case long-run fraction of successful calls computed under partial-information policies is an optimistic estimate of the true worst-case long-run average fraction. Hence, adding partial-information restrictions to the policies should be done on the basis of a careful examination of the model. In alternative to using partial information, we can increase the accuracy of the worst-case estimates of the fraction of successful calls by reducing the role of nondeterminism and providing more probabilistic information about the user’s behavior. Specifically, suppose that we know that the probability that the user will place a call when idle is between 0.1 and 0.2. To represent this range, we modify the above model as follows. From state hidle,k,1i action a (resp. b) leads to hidle,k,1i with probability 0.9 (resp. 0.8), and to htrying,k,0i with probability 0.1 (resp. 0.2). In this model, v− may provide a realistic value for the fraction of successful calls, and the resolution Π of the remaining nondeterminism under perfect information can account for correlations of events not described by the model, such as the above-mentioned soccer-game phenomenon. 7 4 Complexity of Partial-Information Confinement In this section, we present the complexity results for the confinement problem under partial- information policies. By reasoning as in [Rei84], it can be shown that the confinement problem is EXPTIME complete for partial-information policies. Moreover, we show that the confinement problem is NP-complete for memoryless and limit-memoryless partial-information policies. 4.1 General Partial-Information Policies The following theorem states our result for general partial-information policies. Theorem 1 The confinement problem is EXPTIME-complete for the class of partial-information policies. Proof. ThefactthattheproblemisinEXPTIMEfollowsfromthesubsetconstructionof[Rei84]. The lower bound follows by reasoning as in [Rei84] for “blindfold games”, repeating infinitely many times the simulation of the nondeterministic Turing machine. Corollary 1 The problems of pCTL model checking and of the computation of the minimum long-run average outcome are EXPTIME-hard for incomplete-information policies. Proof. The result about pCTL model checking follows directly from Theorem 1 by considering a property of the form 2U, where by abuse of notation we denote by U both a subset of states, and a predicate defining such subset. The result about the minimum long-run average outcome follows from Theorem 1 by considering an MDP in which the set U, once left, cannot be re-entered, together with a function w identically equal to 1, and a function r defined by r(s,a) = 1 if s ∈ U and r(s,a) = 0 if s 6∈ U, for all states s and actions a. 4.2 Memoryless and Limit-Memoryless Partial-Information Policies Amemorylesspartial-informationpolicy isapolicythatisbothmemorylessandpartialinformation. WedenotebyΠ theclassofmemorylesspartial-informationpolicies. Todefinelimit-memoryless MP partial-information policies, for all s ∈ S and a ∈ A(s) we denote by n−1 Nn = Xδ(X = s∧Y = a) s,a k k k=0 the random variable indicating the number of times that the state-action pair s,a appears in the first n steps of a path. A frequency-stable policy is a policy π such that the limit 1 xπ(s,a) = lim Eπ{Nn } (2) t n→∞ n t s,a exists for all t,s ∈ S and a ∈ A(s). The quantity xπ(s,a) is the frequency of state-action pair s,a t from the initial state t under policy π. A policy π is limit-memoryless partial information if it is frequency stable, and if for all states s,t,u ∈ S with t ∼ u, one of the following condition holds: 1. either xπ(t,a) = 0 for all a ∈ A(t); s 8 2. or xπ(u,a) = 0 for all a ∈ A(u); s 3. or, for all a ∈ A(t), xπ(t,a) xπ(u,a) s = s . (3) P xπ(t,b) P xπ(u,b) b∈A(t) s b∈A(u) s Together, these conditions state that each action is chosen with the same relative frequency at s and at t, unless one of s or t has 0 frequencies for all the actions. We denote by Π the class LMP of limit-memoryless partial-information policies. We note that a memoryless partial-information policy is also a limit-memoryless partial-information policy. Intuitively, a limit-memoryless partial- informationpolicyisapolicythatcaninitiallybehaveinanarbitraryway, butthatonthelongrun gives rise to state-action frequencies that correspond to those of a memoryless partial-information policy. Theorem 2 Theconfinementproblemfortheclassesofmemorylessandlimit-memorylesspolicies is NP-complete. Proof. To see that the problems are in NP, note that it suffices to guess a deterministic memo- ryless partial-information policy, and check that it satisfies (1). The proof of NP-hardness is by a reduction from the SAT problem. Consider an instance of SAT problem defined over a finite set Y = {y ,...,y } of variables. Let L = Y ∪{y¯| y ∈ Y} be the set of literals, and let c ,...,c ⊆ L 1 k 1 n be the clauses composing the problem. The SAT problem consists in checking whether the proposi- tional formula Vn W l is satisfiable. From this instance of SAT problem, we construct an MDP i=1 l∈ci P = (S,Acts,A,p) and a partial information relation ∼ as follows. The state space is S = {s ,s }∪{1,...,k+1}×{1,...,n}×{0,1} . 0 1 We let U = S \{s }, and we take s as the initial state. A state of the form hm,i,ji refers to the 0 1 occurrence of variable y in clause c (where y is a dummy variable). The component j of the m i k+1 statekeepstrackofwhetherclausec hasalreadybeensatisfiesbythevariableassignment(j = 1)or i not (j = 0). The set of actions is Acts = {a,b}. We have A(hm,i,ji) = {a,b} for all 1 ≤ m ≤ k+1, all 1 ≤ i ≤ n, and all j = 0,1. Choosing action a (resp. b) at state hm,i,ji corresponds to choosing the truth value true (resp. false) for y in clause c . We let A(s ) = A(s ) = {a}. The transitions m i 0 1 are as follows. We let p(s ,a)(s ) = 1, so that s is absorbing, and we let p(s ,a)(h1,i,0i) = 1/n, 0 0 0 1 for 1 ≤ i ≤ n. For all 1 ≤ i ≤ n and ξ ∈ {a,b}, we let p(hk+1,i,1i,ξ)(s ) = 1 p(hk+1,i,0i,ξ)(s ) = 1 1 0 so that if the clause i has been satisfied, we go back to s , and we proceed to s otherwise. For 1 0 1 ≤ m ≤ k, 1 ≤ i ≤ n, and j = 0,1, the transitions from the other states are defined as follows: • From hm,i,1i, both a and b lead deterministically to hm+1,i,1i. • From hm,i,0i, we have three cases: – If y ∈ c , then a leads deterministically to hm+1,i,1i and b to hm+1,i,0i. m i – If y¯ ∈ c , then a leads deterministically to hm+1,i,0i and b to hm+1,i,1i. m i 9