Solving POMDPs by Searching the Space of Finite Policies
Nicolas Meuleau, Kee-Eung Kim, Leslie Pack Kaelbling and Anthony R. Cassandra
Computer Science Dept, Box 1910, Brown University, Providence, RI 02912-1210 Abstract
storeable in a finite machine). Knowing that any policyis representable as a (possibly infinite) state automaton,the first constraint we would want to impose on the pol- Solving partially observable Markov decision icy is to be representable by a finite state automaton, or, processes (POMDPs) is highly intractable in gen- as we will call it, a finite “policy graph”. Many previous eral, at least in part because the optimal policy approaches implicitly rely on a similar hypothesis: Some may be infinitely large. In this paper, we ex- authors [14, 12, 2, 27] search for optimal reactive (or mem- plore the problem of finding the optimal policy oryless) policies, McCallum [15, 16] searches the space from a restricted set of policies, represented as of policies using a finite-horizon memory, Wiering and finite state automata of a given size. This prob- Schmidhuber’s HQL [26] learns finite sequences of reactive lem is also intractable, but we show that the com- policies, and Peshkin et al. [19] look for optimal finite- plexity can be greatly reduced when the POMDP external-memory policies. All these examples are particu- and/or policy are further constrained. We demon- lar cases of finite policy graphs, with a set of extra structural strate good empirical results with a branch-and- constraints in each case (i.e., not every node-transition and bound method for finding globally optimal deter- action choice is possible in the graph). Note that in general, ministic policies, and a gradient-ascent method finite policy graphs can remember events arbitrarily far in for finding locally optimal stochastic policies.
the past. They are just limited in the number of events theycan memorize.
In this paper we study the problem of finding the best policyrepresentable as a finite policy graph of a given size, pos-sibly with simple constraints on the structure of the graph.
In many application domains, the partially observable The idea of searching explicitly for an optimal finite pol- Markov decision process (POMDP) [1, 22, 23, 5, 9, 4, 13] is icy graph for a given POMDP is not new. In the early 70s, a much more realistic model than its completely observable Satia and Lave [21] proposed a heuristic approach for find- counterpart, the classic MDP [11, 20]. However, the com- ing -optimal decision trees. Hansen [6, 7, 8] proposed a plexity resulting from the lack of observability limits the policy iteration algorithm that outputs an -optimal con- application of POMDPs to dramatically small decision prob- troller. These solution techniques work explicitly in the lems. One of the difficulties of the optimal—Bayesian— belief space used in the classical—Bayesian—optimal so- solution technique is that the policy it produces may use lution of POMDPs, and they output policy-graphs which are the complete previous history of the system to determine from this optimal solution. Another ap- the next action to perform. Therefore, the optimal policy proach uses EM to find controllers that are optimal over a may be infinite and we have to approximate it at some level to be able to implement it in a finite machine. Anotherproblem is that the calculation requires reformulating the A characteristic property of our algorithms is that they scale problem in the continuous space of belief functions, and up well with respect to the size of the problem. Their draw- hence it is much harder than the simple finite computation back is that their execution time increases quickly with the that is sufficient to optimize completely observable MDPs.
size of the policy graph, i.e., with the complexity of thepolicy we are looking for. In general, they will be adapted What can we do if we have to solve a huge POMDP? Since to large POMDPs where relatively simple policies perform it may be impossible just to represent the optimal pol- near optimally. Another characteristic of our approach is icy in memory, it makes sense to restrict our search to that we do not refer to the value of the optimal Bayesian policies that are reasonable in some way (calculable and solution anymore, we just want the best graph given the empirical evidence that our approach allows the solution of constraint imposed on the number of nodes. Note that the some POMDPs whose size is far beyond the limits of clas- optimality criterion used is the same as in the Bayesian ap- proach, i.e., the expected discounted cumulative rewards(the expectation being relative to the prior belief on the POMDPS AND FINITE POLICY
states). However, since we do not evaluate the solution produced relative to the optimal performance, the problemmay be solved without using the notion of belief-space.
Our development relies on a basic property of finite-statecontrollers that has already been stressed by Hansen [6], A partially observable Markov decision process (POMDP) and that is also very close to Parr and Russell’s HAM theo-  ✂✁☎✄✝✆✞✄✠✟✡✄✠☛☞✄✍✌✎✄✝✏✒✑ rem [18]. Namely, given a POMDP and a finite policy graph,the sequence of pairs (state of the POMDP, node of the pol- icy graph) constitutes a Markov chain. Going farther, we define a new MDP on the cross product of the state-space and the set of nodes, where a decision is the choice of an action and of a next node (conditioned on the last obser- vation). Working in this cross-product MDP presents many ☛✔ ✂✕✖✄✍✗✘✑✚✙✜✛✣✢✖ ✤✗✖✥✣✙✦✗✔✧✖✕★✥✩✙✜✕✪✑ advantages: it allows us to calculate and then differentiate the value of a fixed policy, to calculate upper and lower ✌✬ ✂✕✖✄✍✭✮✄✝✕★✯✤✑✩✙✜✛✣✢✖ ✰✕★✥✰✱☎✲✳✙✜✕✴✯✚✧✖✕★✥✩✙✜✕✖✄✍✭✵✥✩✙✶✭✷✑ bounds on the value attainable by completing a given par-tial policy, and also to establish some complexity results.
✥✳✙✺✏✻ ✰✕✼✄✠✭✮✄✴✕✴✯✤✑ We use these properties to develop implementations of two classic search procedures (one global and one local) wherethe majority of the computation consists of solving some Bellman equations in the cross-product MDP. An important  ✰✁☎✄✍✟✬✄✍✌✎✄✝✏✒✑ is optimized is the following way [11, 20]: point is that the structure of both the POMDP and the pol- icy graph (if there is one) is reflected in the cross-product MDP. It can be used to accelerate the solution of Bellmanequations [3], and hence the execution of the solution tech- niques. In other words, the algorithms we propose can ex- ploit the structure of the POMDP to find relatively quickly the best general finite policy graph of a given size. If this leverage is not sufficient, we may limit further the search space by imposing some structure on the policy graph, and to perform in each possible state. The optimal expected then using this structure to speed up the solution of the discounted reward, or “value function”, is defined as the cross-product MDP (for instance, we can limit ourselves to unique solution of the set of Bellman equations: one of the finite-memory architectures mentioned above).
The paper is organized as follows. First we give a quick ✑❜ ❝✏✻ ✰✕✼✄✠✭✮✄✴✕ introduction to POMDPs and policy graphs, and define the deterministic finite policy graph is an NP-hard problem.
for all . It is a remarkable property of MDPs that there There is then no really easy way to solve our problem. In exists an optimal policy that always executes the same ac- this paper, we propose two possible approaches: a global tion in the same state. Unfortunately, this policy cannot be branch and bound search for finding the best determinis- used in the partially observable framework, because of the tic policy graph, and a local gradient descent search for residual uncertainty on the current state of the process.
finding the best stochastic policy graph. These two algo-rithms are based on solving some Bellman equations in the In the POMDP framework, a policy is in general a rule spec- cross-product MDP. Therefore, they can take full advan- ifying the action to perform at each time step as a function tage of any preexisting structure in the POMDP or in the of the whole previous history, i.e., the complete sequence of policy graph. Typically, these algorithms will be adapted observation-action pairs since time 0. A particular kind of to very structured POMDPs with a large number of states, policy, the so-called reactive policies (RPs), condition the a small number of observations, and such that some sim- choice of the next action only on the previous observation.
ple policies perform well. In the end of the paper, we give Thus, they can be represented as mappings and the last observation. We will use the following nota-tion: Figure 1: Structure of the policy graphs representing reac- is the probability of moving from node ✆ (reactive or not) realizes an expected cumu- is the probability distribution of the initial node ✆ The classical—Bayesian—approach allows us to determinethe policy that maximizes this value. It is based on updating In some cases, we will want to impose extra constraints on the state distribution (or belief) at each time step, depend- the policy graph. In most of this paper, we will limit our- ing on the most recent observations [5, 9, 4, 13]. The prob- selves to “restriction constraints” which consist in restrict- lem is re-formulated as a new MDP using belief-states in- ing the set of possible actions executable in some nodes, stead of the original states. Generally, the optimal solution and/or restricting the set of possible successors of some is not a reactive policy. It is a sophisticated behavior, with nodes under some observations. Note that forcing the graph optimal balance between exploration and exploitation. Un- to implement an RP represents a set of restriction constraint fortunately, the Bayesian calculation is highly intractable as defined here. We consider more sophisticated sets of as it searches into the continuous space of beliefs and con- siders every possible sequence of observations.
One advantage of representing the policy as a policy graph A policy graph for a given POMDP is a graph where the is that the cross-product of the POMDP and the policy graph is itself a finite MDP. Another interesting point is that all the structure of both the POMDP and the policy graph (if there arc emanating from each node for each possible observa- is some) is represented in this cross-product MDP. It will tion. When the system is in a certain node, it executes the allow us to develop relatively fast implementations of some action associated with this node. This implies a state transi- classical techniques to solve our problem.
tion in the POMDP and eventually a new observation (whichdepends on the arrival state of the underlying MDP). This Calculating the value of a policy graph
observation itself conditions a transition in the policy graph theorem has been used by Hansen [6, 8, 7], and his closely to the destination node of the arc associated with the new related to Parr and Russell’s HAM theorem [18].
observation. Every policy has a representation as a possi-bly (countably) infinite policy graph. A policy that chooses Theorem 1 Given a policy graph
a different action for each possible previous history will be  ✂✁☎✄✝✆✞✄✠✟✡✄✠☛☞✄✍✌✎✄✝✏✒✑ represented by an infinite tree with a branch for each possi- generated constitutes a Markov chain. ble history. Reactive policies correspond to a special kindof finite policy graph with as many nodes as there are ob- The influence diagram of figure 2 proves this property: ✥✂✱☎✲✪✄✴✕✴✥✂✱☎✲✠✑ the same observation go to the same node (figure 1).
In a stochastic policy graph there is a probability distribu- tion over actions attached to each node instead of a single ✄✠✭✷✑❝✌✬ ✂✕✖✄✍✭✮✄✝✕ action, and transitions from one node to another are ran- dom, the arrival node depending only on the starting node where the maximization is applied row by row. When weexpand this equation, the maximization over Figure 2: Influence diagram proving the Markov property ✑✭✬✪✏✻ ✰✕✼✄✠✭✮✄✴✕ of the cross-product MDP. Dotted arrows represent depen- dencies that we did not take into account in this work, but that are sometimes represented in other formulations. As shown, theorem 1 is still valid in this more general frame- The expected optimal reward, independent of the starting In the same way, we can calculate the expected immediate . Note that this optimal policy is generally not imple- ✑❴✏✻ ✂✕✖✄✍✭✮✄✝✕ mentable in the policy graph, since it may associate two different actions with the same node, depending on the state with which the node is coupled. In other words, we need to know the current state to use this policy. The agent using the fundamental equation (in matrix form): a policy graph is basically embedded in the cross-product MDP, but it has only partial observability of its product state fundamental equation (7) is useful in some algorithms, be- cause it represents an upper-bound of the performance at- Finally, the value of the policy, independent of the starting tainable by any implementable policy. We will use this in a branch-and-bound algorithm for finding the optimal deter-ministic policy graph. Note that the addition of restriction constraints on the policy, as defined in section 2.2, does not invalidate theorem 2. It just limits the set of possible ac- tions in some states of the cross-product MDP, and then it reduces the complexity of its solution. As a consequence, the branch-and-bound algorithm will also be able to find the best graph under some restriction constraints.
Differentiating this value with respect to the parameters ofthe graph will enable us to climb its gradient.
Computational leverage.
that most of the computation performed by the algorithms Solving the cross-product MDP.
that will follow consists of solving a Bellman fundamen- tal equation with a fixed policy as in (4), or for the sake of an action , wait for the new observation, and then chose finding the optimal deterministic policy as in (7). This can the next node. It is equivalent to choosing an action and be done by successive approximations, the algorithm being called “value iteration” in the case of (7). The complexity as a function of the next observation. Therefore, the action (times the number of iterations, which can be Theorem 2 The tuple
important point is that any structure in the transition matrix can be exploited while executing these back-ups. The ✄✝✕❊✑❴✑✣✙✶✌✬ ✰✕✼✄✠✭✮✄✴✕ the structure of the POMDP: A sparse transition matrix of the POMDP provides leverage that allows the speed-up of successive-approximation iterations [3].
✯✂✄✴✕✴✯❵✑❴✑✚✙✜✏✻ ✰✕✼✄✠✭✮✄✴✕✴✯❵✑ is the branching factor of the POMDP (i.e., the average number of possible successors of a state) It is a branch-and-bound algorithm; i.e., it systematically enumerates the set of all possible solutions using bounds For instance, deterministic transitions reduce the com- on the quality of partial solutions to exclude entire regions of the search space. If the lower bound of one partial policy observation matrix is exploitable. For instance, deter- is greater than the upper bounds of others, then it is useless ministic observations reduce the complexity by a fac- explore these partial policies. Otherwise, each possible ex- tension of them will considered in time. Therefore, the al-gorithm is guaranteed to find the optimal solution in finite the structure of the policy graph: If the leverage gained time. Note that this approach is a generalization of a pre- from the structure of the POMDP is not sufficient, vious algorithm proposed by Littman [14]. His algorithm then one can choose to restrict further the search is limited to policy graphs representing RPs and to POMDPs space by imposing structural constraints on the graph, with a very particular structure: state-transitions and obser- and using this structure to speed up the calculation.
vations are deterministic, and the problem is an achieve- An extreme, but often adopted, solution is to look ment task (i.e., there is a given goal state that must be reached as soon as possible). The formalism proposed here handles any kind of POMDP and any kind of policy graph with restriction constraints. However, Littman gives more more about constraining the policy graph in section 4.
details on some aspects of the algorithm, and the reader canrefer to his paper to complete the brief description that we When both the problem and the policy are structured, the leverage gained can be bigger than just the addition of theeffect of the two structures. For instance, evaluating a Ordering of free parameters.
RP in a completely deterministic problem can be done in policies is expanded (in depth-first, breadth-first, or in a best-first way) by picking the free parameters of the pol-icy one after the other, and considering all possible assign- FINDING THE OPTIMAL POLICY
ment values for each of them. The game of pruning some branches based on upper bound/lower bounds comparisonis added onto that. The size of the tree that is actually ex- In this paper, we consider the problem of finding the best panded in this process strongly depends on the order in policy graph of a given size for a POMDP. Littman [14] which the free parameters are picked. In our case, it is showed that finding the best RP for a given POMDP is an NP-hard problem. First, we generalize this result to any values of ✔ . In other words, when building a solution, we finite policy graph with a given number of nodes and any assign actions to the nodes first, and then we fix the struc- ture of the graph. Otherwise, no pruning is possible beforeall possible structures have been expanded (this is due to Theorem 3 Given a POMDP and a set of restriction con-
the nature of the upper-bound that we use, see below). In straints, the problem of finding the optimal deterministic policy graph satisfying the constraints is NP-hard. but before ✔ . There is also an issue with the symmetry ofthe policy-graph space. For instance, in the absence of re- The proof is straightforward: Finding the best deterministic striction constraints, we can permute the role played by the policy graph is equivalent to finding the best deterministic different nodes without changing the policy. Each policy RP of the cross-product POMDP. Then the result follows avoid enumerating equivalent graphs by imposing some ar-bitrary rule when expanding the tree. For instance we can The techniques for solving NP-hard problems may be clas- impose that the index of the action attached to a node al- sified into three groups: global search, local search and ways be greater or equal to the index of the action of node approximation algorithms. In this paper, we will use two , for all . This simple trick can improve greatly the classic techniques, a global search (section 3.1) and a local performance of the heuristic search, merely dividing the search (section 3.2). We will consider a possible approxi- GLOBAL SEARCH
Upper bounds.
A partial solution is a general finite pol- icy graph with more restriction constraints than initially A heuristically guided search is used to find the best deter- (each time we specify an action or a node-transition, we ministic policy graph of a given size, whatever the restric- add a constraint). Then we can get an upper bound by solv- ing the cross-product MDP, as explained in the second part LOCAL SEARCH
specified policy graph corresponds to an RP of the cross-product (PO)MDP, so no policy graph can do better than In this approach, we try to find the best stochastic policy the optimal solution of the cross-product MDP. Note that graph by treating this problem as a classical non-linear nu- this upper bound has a nice monotonicity property: it does merical optimization problem. Since the value of a policy not increase when we fix a free parameter, and it is equal to graph is continuous and differentiable with respect to the the true value of the policy graph when the graph is com- policy parameters, we can calculate its gradient and climb pletely specified. On the other hand, as long as no value it in many different ways. We will not develop all the pos- sibilities for climbing the gradient here, but we will rather is specified, the optimal policy found by solving the cross-product MDP is equivalent to the optimal policy of the focus on the calculation of the gradient, and then just de-  ✰✁☎✄✍✟✬✄✍✌✎✄✝✏✒✑ pict a simple implementation of gradient ascent. Note that since the gradient may be calculated exactly, this approach is independent of ✆ and depends only on . Hence, the calculated upper bound is always equal to the value of is guaranteed to converge to a local optimum. The topology the optimal policy of the cross-product MDP. This is why of the search space, and hence the number of local optima, no pruning can be done as long as no value of depends on two things: the structure of the specified and this parameter must be the considered first and the constraints imposed on the policy. By introducing constraints on the policy, we can hope not only to reducethe execution time of the algorithm, but also to change the“landscape” for a less multimodal one.
Lower bounds.
order, then we can use the values of the complete poli- Calculating the gradient.
cies already expanded to determine lower bounds on the is given by equation (6). For each policy parameter best performance attainable by extending each partial pol- icy. Otherwise, we can find a lower bound for a given par- tial policy by completing it at random and calculating thevalue of the resulting complete policy. An improvement consists of performing a simple local optimization after having completed the policy [14]. In our implementation, we also used a heuristic technique based on the solution of We are interested in the gradient with respect to the pol- the cross-product MDP to complete the partial policy. We icy parameters, i.e., we will consider   calculate the performance of a complete policy by solving equation (4) in the cross-product MDP.
respect to these variables can be calculated easily startingfrom (2) and (3). The main difficulty in the calculation of Complexity.
bounds of each node of the expanded tree requires solving some Bellman equations in the cross-product MDP. Hence approximation, the basic update rule being: ber of iterations of successive approximation executed dur- ing this calculation, one can store, with each partial policy, , the complexity of a complete back up is the value function found when calculating its upper bound.
Then we can start the computation of the upper bound of inverse can be used to calculate the gradient with respect to a child partial policy starting from the value of its parent.
any parameter   . A minor acceleration can be obtained by Since they are often not very different, we can gain a lot of time with this trick. However, the memory space needed in- the iterative computation of this value at the new point. It creases dramatically. Even if we can calculate the bounds can reduce the number of iterations of (8) at each step, but relatively quickly, the real problem is how many nodes it it is still a matrix-wise DP with a complexity in will be necessary to expand before reaching the optimum.
In the worst case, the complete tree of all possible solu-tions will be expanded, which represents a complexity ex- There is another way of computing the gradient with a com- ponential in the number of degrees of freedom of the policy graph. In practice, our simulations showed that many fewer nodes are actually expanded. Note that adding simple con- several (classical) vector-wise DPs for which complexity straints on the policy reduces not only the complexity of , or less if there is some structure in ✄ the search space and hence the number of nodes expanded.
Figure 3: The load/unload problem with 8 locations: the agent starts in the “Unload” location (U) and receives a re- ward each time it returns to this place after passing through the “Load” location (L). The problem is partially observ- able because the agent cannot distinguish the different lo- which is also a vector-wise, square-complexity DP. The cations in between Load and Unload, and because it can- tion must be re-done for each policy parameter and ❙ ✳ depend on   . Thus, we have divided the complex- This Monte-Carlo approach works only if there are strong structural constraints on the graph, and thus cannot be ap- free variables of the graph. However, this approach will be plied for finding general finite policy graphs. Note also that useful in most cases, since there are often many fewer free Littman reported observing a great superiority (in terms of variables in the policy graph than “cross-product states”.
execution time) of the global branch-and-bound search over For instance, if we are looking for the best reactive policy, the Monte-Carlo approach, in the case where the graph is then the indirect calculation allows us to gain a factor of constrained to encode a simple RP. Our simulations with other architectures (sets of structural constraints) showedsimilar results: in general, the Monte-Carlo approach can- Climbing the gradient.
the problem is somewhat more complicated since all the INTRODUCING STRUCTURAL
parameters that we optimize are probabilities and we have CONSTRAINTS
to ensure that they stay valid (i.e., inside of the simplex)after each update. There are numerous ways for doing that, Because the majority of their computation is to perform including renormalizing and using the soft-max function.
Bellman back-ups in the cross-product MDP, the algorithms In our implementation, we chose to project the calculated outlined above can take advantage of any preexisting struc- gradient on the simplex, and then apply it until we reach an ture in the POMDP. However, this leverage can be insuffi- edge of the simplex. If we reach an edge and the gradient cient if the problem is too big or too difficult for the two points outside of the simplex, then we project the gradient techniques. In this case, one may whish to restrict further the search space by imposing structural constraints on thepolicy graph. For instance, a simple solution consists of Related work.
The idea of using a gradient algorithm for defining a neighborhood for the nodes of the graph, and solving POMDPs has already been pursued by several au- allowing transitions only to a neighboring node. This cor- thors [2, 12, 27]. The main difference between this work responds to a set of restriction constraints (✔ is forced to and ours is that these authors use a Monte-Carlo estima- take the value zero in many points), and hence the algo- tion of the gradient instead of an exact calculation, and that rithm above can still be applied. A somehow extreme so- they limit themselves to RPs, which is much less general lution consists of limiting the search to reactive policies than our approach. Moreover, Jaakkola et al. do not use the (then ✔ is completely fixed in advance). More complex sets exponentially discounted criterion (1), but the average re- of constraints can also be used, for instance, we can limit ward per time step. In a companion paper [17], we propose the search to policy representable as a finite sequence of a stochastic gradient descent approach for learning finite RPs (with particular rules governing the transition from one policy graph during a trial-based interaction with the pro- RP to another), as in Wiering and Schmidhuber’s HQL [26].
Other instances include the finite-horizon memory policiesused by McCallum [15, 16], or the external-memory poli- OTHER APPROACHES
cies used by Peshkin et al. [19]. Although these architec-tures cannot be described only in terms of restriction con- A Monte-Carlo approach based on Watkins’Q-learning straints (there are also equality constraints between differ- [25, 24] is also applicable to our problem. For instance, we ent parameters of the graph), the previous results and al- can an use Q-learning based on observation-action pairs to gorithm can be extended to each of them in particular. In find (with no guarantee of convergence) the optimal RP for other words, we can use the previous algorithm to find a POMDP [14]. Another instance is Wiering and Schmid-huber’s HQL [26], which learns finite sequences of RPs.
Figure 5: A partially observable stochastic maze: the agent must go from the starting state marked with an “S”to the goal marked with an “G”. The problem is partially observ- able because the agent cannot perceive its true location, butonly its orientation and the presence or the absence of a Figure 4: Simulations results obtained with the load/unload wall on each side of the square defining its current state.
problem: execution time of the algorithms as a function of The problem is stochastic because there is a non-zero prob- the size of the problem (ga: gradient ascent, dfh: depth first ability of slipping, so that the agent does not always know heuristic search, bfh: breadth first heuristic search).
if its last attempt to make a move had any consequence onits actual position in the maze.
the best policy using a given finite-horizon memory, the best policy using an external-memory of a given easy POMDP, the results obtained represent a kind of upper bound on the performance of the algorithms. It is unlikelythat they will perform better on another (harder) problem.
(we can also show that it is NP-hard to solve these prob- 2 and the gradient algorithm always started from the centerof the simplex (i.e., the policy graph is initialized with uni- What do we gain and what do we lose when we impose form distributions).1 We measured the time of execution of structural constraints reduces the number of parameters per ascent, we stopped when we reached 99% of the optimal.
node (which should help both techniques), and modifies the When the heuristic search uses a stochastic calculation of topology of the search space (which influences the gradi- upper bounds, we average the measure over 50 runs.
ent descent approach). Another point is that the best graph set to 0.996 (a big value is necessary to accommodate big without the constraints can be better than the best graph state-spaces), and the learning rate of gradient descent was with the constraints, i.e., the constraints can decrease the optimized. The results are given in figure 4. They show value of the best solution. Even if this does not happen, that the heuristic search clearly outperforms the gradient more nodes may be required to reach the best performance algorithm, which becomes numerically unstable when the with the constraints than without. Consider, for instance, number of states increases in this kind of geometrically- the load/unload problem represented in figure 3. This sim- ple problem is solved optimally with a two-node policygraph, or with a sequence of two RPs as used in HQL. As an In the second set of experiments, we wanted to measure how far our algorithms can go in terms of problem size, in a problem more difficult than the simple load/unload.
the number of parameters per node will be smaller than in We used the a set of partially observable mazes with the the unconstrained case. In general, adding structure will be regular structure represented in figure 5, and whose size interesting if we choose an architecture that fits the prob- lem at hand. Hence, it is a question of previous knowledge mazes are not particularly easy, since they have only two about the problem at hand and its optimal solution.
different optimal paths. The minimal number of nodes forsolving them is 4, one per action (although the policy is not SIMULATION RESULTS
reactive). The time required for the (depth first) branch-and-bound algorithm to find the optimal solution with thisoptimal number of nodes is shown in figure 6. We see that In our first experiments, we used the simple load/unloadproblem of figure 3 with an increasing number of loca- 1We used a simpler version of the algorithms where the start- tions, to see how both algorithms scale up to increasing ing node is fixed. Otherwise, the policy using only uniform dis- problem-size, and how they compare. Since it is a very tributions is a (very unstable) local optimum.
tional leverage. Journal of AI Research, To appear, [4] A.R. Cassandra. Exact and Approximate Algorithms for Partially Observable Markov Decision Processes.
[5] A.R. Cassandra, L.P. Kaelbling, and M.L. Littman.
Acting optimally in partially observable stochastic domains. In Proceedings of the Twelfth National Con- ference on Artificial Intelligence, 1994.
[6] E.A. Hansen. An improved policy iteration algorithm for partially observable MDPs. In Advances in Neu-ral Information Processing Systems, 10. MIT Press, Figure 6: Performance of the branch-and-bound algorithm on the maze problem: execution time as a function of thenumber of states.
Finite-Memory Control of Partially Computer Science, University of Massachusetts at we can solve a partially observable maze with almost 1000 states in a less than 6 hours. It represents a performancefar above the capacities of classic approaches for solving [8] E.A. Hansen. Solving POMDPs by searching in pol- POMDPs. Note also that, as the number of states grows, icy space. In Proceedings of the Eighth Conference on the measured complexity is almost linear in the number of Uncertainty in Artificial Intelligence, pages 211–219, CONCLUSION
[9] M. Hauskrecht. Planning and Control in Stochas- tic Domains with Imperfect Information. PhD thesis,MIT, Cambridge, MA, 1997.
We studied the problem of finding the optimal policy rep-resentable as a finite state automaton of a given size, pos- [10] O. Higelin. Optimal Control of Complex Structured sibly with some simple structural constraints. This ap- Processes. PhD thesis, University of Caen, France, proach by-passes the continuous and intractable belief-state space. However, we showed that we end up with a NP-hard problem anyway. Then we proposed to use two classic [11] R.A. Howard. Dynamic Programming and Markov search techniques, and developed efficient implementations Processes. MIT Press, Cambridge, 1960.
of them that allow using the structure of the problem to ac-celerate the computation. If this is not sufficient, bigger [12] T. Jaakkola, S. Singh, and M.R. Jordan.
leverage can be gained by imposing structure on the policy.
forcement learning algorithm for partially observable However, our algorithms are limited by necessity to enu- Markov problems. In Advances in Neural Informa- merate at least once per iteration, the complete state space tion Processing Systems, 7. MIT Press, Cambridge, of the POMDP. In a companion paper [17], we propose an indirect learning algorithm that avoids this bottleneck.
[13] L.P. Kaelbling, M.L. Littman, and A.R. Cassandra.
Planning and acting in partially observable stochastic References
domains. Artificial Intelligence, 101, 1998.
[1] K.J. Astrom. Optimal control of Markov decision pro- [14] M.L. Littman. Memoryless policies: Theoretical lim- cesses with incomplete state estimation. J. Math. Anl. Animats 3: Proceedings of the Third International [2] L.C. Baird and A.W. Moore. Gradient descent for Conference on Simulation of Adaptive Behavior. MIT general reinforcement learning. In Advances in Neu- ral Information Processing Systems, 12. MIT Press,Cambridge, MA, 1999.
[15] R.A. McCallum. Overcoming incomplete perception with utile distinction memory. In The Proceedings [3] C. Boutillier, T.L. Dean, and S. Hanks. Decision the- of the Tenth International Machine Learning Confer- oretic planning: structural assumptions and computa- [16] R.A. McCallum. Reinforcement Learning with Selec- tive Perception and Hidden State. PhD thesis, Univer-sity of Rochester, Rochester, NY, 1995.
[17] N. Meuleau, L. Peshkin, K.E. Kim, and L.P. Kael- bling. Learning finite-state controllers for partiallyobservable environments.
teenth Conference on Uncertainty in Artificial Intel-ligence, To appear, 1999.
[18] R. Parr and S. Russell. Reinforcement learning with hierarchies of machines. In Advances in Neural In-formation Processing Systems 11. MIT Press, Cam-bridge, MA, 1998.
[19] L. Peshkin, N. Meuleau, and L.P. Kaelbling. Learn- ing policies with external memory. Proceedings ofthe Sixteenth International Conference on MachineLearning, To appear, 1999.
[20] M.L. Puterman. Markov Decision Processes: Dis- crete Stochastic Dynamic Programming. Wiley, NewYork, NY, 1994.
[21] J.K. Satia and R.E. Lave. Markov decision processes with probabilistic observation of states. ManagementScience, 20(1):1–13, 1973.
[22] R.D. Smallwood and E.J. Sondik. The optimal con- trol of partially observable Markov decision processesover a finite horizon. Operations Research, 21:1071–1098, 1973.
[23] E.J. Sondik. The optimal control of partially observ- able Markov decision processes over the infinite hori-zon: Discounted costs.
[24] R.S. Sutton and A.G. Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, 1998.
[25] C. Watkins. Learning from Delayed Rewards. PhD thesis, King’s College, Cambridge, 1989.
[26] M. Wiering and J. Schmidhuber. HQ-Learning. Adap- tive Behavior, 6(2):219–246, 1997.
[27] R.J. Williams. Towards a theory of reinforcement- port NU-CCS-88-3, Northeastern University, Boston,MA, 1988.


DIÁRIO DA REPÚBLICA — I SÉRIE-A N.o 178 — 15 de Setembro de 2005 PRESIDÊNCIA DA REPÚBLICA 2005, o Decreto do Presidente da República n.o 42/2005,de 2 de Agosto, rectifica-se que onde se lê «ministroplenipotenciário de 1.a classe Joaquim José Ferreira da Declaração de Rectificação n.o 67/2005 Fonseca Embaixador de Portugal no Panamá» devePor ter sido publicado com ine

Microsoft word - lebri43.doc

Leserbriefinfos Nr. 43 1. Renaissance der Reaktoren . Die Kernkraft kommt zurück. Das ist der Tenor der Energiedebatte in den USA. Ab 2010 müssen in den USA 500.000 MW Kraftwerkskapazität ersetzt werden. Seit Mitte der 70er Jahre ist in den USA kein neues Kernkraftwerk (KKW) mehr bestellt worden. Das war auch nicht nötig, denn die Verfügbarkeit der KKW wurde durch die Betriebser

Copyright 2014 Pdf Medic Finder