Solving POMDPs by Searching the Space of Finite Policies Nicolas Meuleau, Kee-Eung Kim, Leslie Pack Kaelbling and Anthony R. Cassandra
Computer Science Dept, Box 1910, Brown University, Providence, RI 02912-1210
Abstract
storeable in a finite machine). Knowing that any policyis representable as a (possibly infinite) state automaton,the first constraint we would want to impose on the pol-
Solving partially observable Markov decision
icy is to be representable by a finite state automaton, or,
processes (POMDPs) is highly intractable in gen-
as we will call it, a finite “policy graph”. Many previous
eral, at least in part because the optimal policy
approaches implicitly rely on a similar hypothesis: Some
may be infinitely large. In this paper, we ex-
authors [14, 12, 2, 27] search for optimal reactive (or mem-
plore the problem of finding the optimal policy
oryless) policies, McCallum [15, 16] searches the space
from a restricted set of policies, represented as
of policies using a finite-horizon memory, Wiering and
finite state automata of a given size. This prob-
Schmidhuber’s HQL [26] learns finite sequences of reactive
lem is also intractable, but we show that the com-
policies, and Peshkin et al. [19] look for optimal finite-
plexity can be greatly reduced when the POMDP
external-memory policies. All these examples are particu-
and/or policy are further constrained. We demon-
lar cases of finite policy graphs, with a set of extra structural
strate good empirical results with a branch-and-
constraints in each case (i.e., not every node-transition and
bound method for finding globally optimal deter-
action choice is possible in the graph). Note that in general,
ministic policies, and a gradient-ascent method
finite policy graphs can remember events arbitrarily far in
for finding locally optimal stochastic policies.
the past. They are just limited in the number of events theycan memorize. INTRODUCTION
In this paper we study the problem of finding the best policyrepresentable as a finite policy graph of a given size, pos-sibly with simple constraints on the structure of the graph.
In many application domains, the partially observable
The idea of searching explicitly for an optimal finite pol-
Markov decision process (POMDP) [1, 22, 23, 5, 9, 4, 13] is
icy graph for a given POMDP is not new. In the early 70s,
a much more realistic model than its completely observable
Satia and Lave [21] proposed a heuristic approach for find-
counterpart, the classic MDP [11, 20]. However, the com-
ing -optimal decision trees. Hansen [6, 7, 8] proposed a
plexity resulting from the lack of observability limits the
policy iteration algorithm that outputs an -optimal con-
application of POMDPs to dramatically small decision prob-
troller. These solution techniques work explicitly in the
lems. One of the difficulties of the optimal—Bayesian—
belief space used in the classical—Bayesian—optimal so-
solution technique is that the policy it produces may use
lution of POMDPs, and they output policy-graphs which are
the complete previous history of the system to determine
from this optimal solution. Another ap-
the next action to perform. Therefore, the optimal policy
proach uses EM to find controllers that are optimal over a
may be infinite and we have to approximate it at some level
to be able to implement it in a finite machine. Anotherproblem is that the calculation requires reformulating the
A characteristic property of our algorithms is that they scale
problem in the continuous space of belief functions, and
up well with respect to the size of the problem. Their draw-
hence it is much harder than the simple finite computation
back is that their execution time increases quickly with the
that is sufficient to optimize completely observable MDPs.
size of the policy graph, i.e., with the complexity of thepolicy we are looking for. In general, they will be adapted
What can we do if we have to solve a huge POMDP? Since
to large POMDPs where relatively simple policies perform
it may be impossible just to represent the optimal pol-
near optimally. Another characteristic of our approach is
icy in memory, it makes sense to restrict our search to
that we do not refer to the value of the optimal Bayesian
policies that are reasonable in some way (calculable and
solution anymore, we just want the best graph given the
empirical evidence that our approach allows the solution of
constraint imposed on the number of nodes. Note that the
some POMDPs whose size is far beyond the limits of clas-
optimality criterion used is the same as in the Bayesian ap-
proach, i.e., the expected discounted cumulative rewards(the expectation being relative to the prior belief on the
POMDPS AND FINITE POLICY
states). However, since we do not evaluate the solution
produced relative to the optimal performance, the problemmay be solved without using the notion of belief-space.
Our development relies on a basic property of finite-statecontrollers that has already been stressed by Hansen [6],
A partially observable Markov decision process (POMDP)
and that is also very close to Parr and Russell’s HAM theo-
✂✁☎✄✝✆✞✄✠✟✡✄✠☛☞✄✍✌✎✄✝✏✒✑
rem [18]. Namely, given a POMDP and a finite policy graph,the sequence of pairs (state of the POMDP, node of the pol-
icy graph) constitutes a Markov chain. Going farther, we
define a new MDP on the cross product of the state-space
and the set of nodes, where a decision is the choice of an
action and of a next node (conditioned on the last obser-
vation). Working in this cross-product MDP presents many
☛✔ ✂✕✖✄✍✗✘✑✚✙✜✛✣✢✖ ✤✗✖✥✣✙✦✗✔✧✖✕★✥✩✙✜✕✪✑
advantages: it allows us to calculate and then differentiate
the value of a fixed policy, to calculate upper and lower
✌✬ ✂✕✖✄✍✭✮✄✝✕★✯✤✑✩✙✜✛✣✢✖ ✰✕★✥✰✱☎✲✳✙✜✕✴✯✚✧✖✕★✥✩✙✜✕✖✄✍✭✵✥✩✙✶✭✷✑
bounds on the value attainable by completing a given par-tial policy, and also to establish some complexity results.
✥✳✙✺✏✻ ✰✕✼✄✠✭✮✄✴✕✴✯✤✑
We use these properties to develop implementations of two
classic search procedures (one global and one local) wherethe majority of the computation consists of solving some
Bellman equations in the cross-product MDP. An important
✰✁☎✄✍✟✬✄✍✌✎✄✝✏✒✑
is optimized is the following way [11, 20]:
point is that the structure of both the POMDP and the pol-
icy graph (if there is one) is reflected in the cross-product
MDP. It can be used to accelerate the solution of Bellmanequations [3], and hence the execution of the solution tech-
niques. In other words, the algorithms we propose can ex-
ploit the structure of the POMDP to find relatively quickly
the best general finite policy graph of a given size. If this
leverage is not sufficient, we may limit further the search
space by imposing some structure on the policy graph, and
to perform in each possible state. The optimal expected
then using this structure to speed up the solution of the
discounted reward, or “value function”, is defined as the
cross-product MDP (for instance, we can limit ourselves to
unique solution of the set of Bellman equations:
one of the finite-memory architectures mentioned above).
The paper is organized as follows. First we give a quick
✑❜ ❝✏✻ ✰✕✼✄✠✭✮✄✴✕
introduction to POMDPs and policy graphs, and define the
deterministic finite policy graph is an NP-hard problem.
for all . It is a remarkable property of MDPs that there
There is then no really easy way to solve our problem. In
exists an optimal policy that always executes the same ac-
this paper, we propose two possible approaches: a global
tion in the same state. Unfortunately, this policy cannot be
branch and bound search for finding the best determinis-
used in the partially observable framework, because of the
tic policy graph, and a local gradient descent search for
residual uncertainty on the current state of the process.
finding the best stochastic policy graph. These two algo-rithms are based on solving some Bellman equations in the
In the POMDP framework, a policy is in general a rule spec-
cross-product MDP. Therefore, they can take full advan-
ifying the action to perform at each time step as a function
tage of any preexisting structure in the POMDP or in the
of the whole previous history, i.e., the complete sequence of
policy graph. Typically, these algorithms will be adapted
observation-action pairs since time 0. A particular kind of
to very structured POMDPs with a large number of states,
policy, the so-called reactive policies (RPs), condition the
a small number of observations, and such that some sim-
choice of the next action only on the previous observation.
ple policies perform well. In the end of the paper, we give
Thus, they can be represented as mappings
and the last observation. We will use the following nota-tion:
Figure 1: Structure of the policy graphs representing reac-
is the probability of moving from node ✆
(reactive or not) realizes an expected cumu-
is the probability distribution of the initial node ✆
The classical—Bayesian—approach allows us to determinethe policy that maximizes this value. It is based on updating
In some cases, we will want to impose extra constraints on
the state distribution (or belief) at each time step, depend-
the policy graph. In most of this paper, we will limit our-
ing on the most recent observations [5, 9, 4, 13]. The prob-
selves to “restriction constraints” which consist in restrict-
lem is re-formulated as a new MDP using belief-states in-
ing the set of possible actions executable in some nodes,
stead of the original states. Generally, the optimal solution
and/or restricting the set of possible successors of some
is not a reactive policy. It is a sophisticated behavior, with
nodes under some observations. Note that forcing the graph
optimal balance between exploration and exploitation. Un-
to implement an RP represents a set of restriction constraint
fortunately, the Bayesian calculation is highly intractable
as defined here. We consider more sophisticated sets of
as it searches into the continuous space of beliefs and con-
siders every possible sequence of observations. THE CROSS-PRODUCT MDP FINITE POLICY GRAPHS
One advantage of representing the policy as a policy graph
A policy graph for a given POMDP is a graph where the
is that the cross-product of the POMDP and the policy graph
is itself a finite MDP. Another interesting point is that all the
structure of both the POMDP and the policy graph (if there
arc emanating from each node for each possible observa-
is some) is represented in this cross-product MDP. It will
tion. When the system is in a certain node, it executes the
allow us to develop relatively fast implementations of some
action associated with this node. This implies a state transi-
classical techniques to solve our problem.
tion in the POMDP and eventually a new observation (whichdepends on the arrival state of the underlying MDP). This
Calculating the value of a policy graph
observation itself conditions a transition in the policy graph
theorem has been used by Hansen [6, 8, 7], and his closely
to the destination node of the arc associated with the new
related to Parr and Russell’s HAM theorem [18].
observation. Every policy has a representation as a possi-bly (countably) infinite policy graph. A policy that chooses
Theorem 1 Given a policy graph
a different action for each possible previous history will be
✂✁☎✄✝✆✞✄✠✟✡✄✠☛☞✄✍✌✎✄✝✏✒✑
represented by an infinite tree with a branch for each possi-
generated constitutes a Markov chain.
ble history. Reactive policies correspond to a special kindof finite policy graph with as many nodes as there are ob-
The influence diagram of figure 2 proves this property:
✥✂✱☎✲✪✄✴✕✴✥✂✱☎✲✠✑
the same observation go to the same node (figure 1).
In a stochastic policy graph there is a probability distribu-
tion over actions attached to each node instead of a single
✄✠✭✷✑❝✌✬ ✂✕✖✄✍✭✮✄✝✕
action, and transitions from one node to another are ran-
dom, the arrival node depending only on the starting node
where the maximization is applied row by row. When weexpand this equation, the maximization over
Figure 2: Influence diagram proving the Markov property
✑✭✬✪✏✻ ✰✕✼✄✠✭✮✄✴✕
of the cross-product MDP. Dotted arrows represent depen-
dencies that we did not take into account in this work, but
that are sometimes represented in other formulations. As
shown, theorem 1 is still valid in this more general frame-
The expected optimal reward, independent of the starting
In the same way, we can calculate the expected immediate
. Note that this optimal policy is generally not imple-
✑❴✏✻ ✂✕✖✄✍✭✮✄✝✕
mentable in the policy graph, since it may associate two
different actions with the same node, depending on the state
with which the node is coupled. In other words, we need to
know the current state to use this policy. The agent using
the fundamental equation (in matrix form):
a policy graph is basically embedded in the cross-product
MDP, but it has only partial observability of its product state
fundamental equation (7) is useful in some algorithms, be-
cause it represents an upper-bound of the performance at-
Finally, the value of the policy, independent of the starting
tainable by any implementable policy. We will use this in a
branch-and-bound algorithm for finding the optimal deter-ministic policy graph. Note that the addition of restriction
constraints on the policy, as defined in section 2.2, does not
invalidate theorem 2. It just limits the set of possible ac-
tions in some states of the cross-product MDP, and then it
reduces the complexity of its solution. As a consequence,
the branch-and-bound algorithm will also be able to find
the best graph under some restriction constraints.
Differentiating this value with respect to the parameters ofthe graph will enable us to climb its gradient. Computational leverage.
that most of the computation performed by the algorithms
Solving the cross-product MDP.
that will follow consists of solving a Bellman fundamen-
tal equation with a fixed policy as in (4), or for the sake of
an action , wait for the new observation, and then chose
finding the optimal deterministic policy as in (7). This can
the next node. It is equivalent to choosing an action and
be done by successive approximations, the algorithm being
called “value iteration” in the case of (7). The complexity
as a function of the next observation. Therefore, the action
(times the number of iterations, which can be
Theorem 2 The tuple
important point is that any structure in the transition matrix
can be exploited while executing these back-ups. The
✄✝✕❊✑❴✑✣✙✶✌✬ ✰✕✼✄✠✭✮✄✴✕
the structure of the POMDP: A sparse transition matrix
of the POMDP provides leverage that allows the
speed-up of successive-approximation iterations [3].
✯✂✄✴✕✴✯❵✑❴✑✚✙✜✏✻ ✰✕✼✄✠✭✮✄✴✕✴✯❵✑
is the branching factor of the POMDP (i.e.,
the average number of possible successors of a state)
It is a branch-and-bound algorithm; i.e., it systematically
enumerates the set of all possible solutions using bounds
For instance, deterministic transitions reduce the com-
on the quality of partial solutions to exclude entire regions
of the search space. If the lower bound of one partial policy
observation matrix is exploitable. For instance, deter-
is greater than the upper bounds of others, then it is useless
ministic observations reduce the complexity by a fac-
explore these partial policies. Otherwise, each possible ex-
tension of them will considered in time. Therefore, the al-gorithm is guaranteed to find the optimal solution in finite
the structure of the policy graph: If the leverage gained
time. Note that this approach is a generalization of a pre-
from the structure of the POMDP is not sufficient,
vious algorithm proposed by Littman [14]. His algorithm
then one can choose to restrict further the search
is limited to policy graphs representing RPs and to POMDPs
space by imposing structural constraints on the graph,
with a very particular structure: state-transitions and obser-
and using this structure to speed up the calculation.
vations are deterministic, and the problem is an achieve-
An extreme, but often adopted, solution is to look
ment task (i.e., there is a given goal state that must be
reached as soon as possible). The formalism proposed here
handles any kind of POMDP and any kind of policy graph
with restriction constraints. However, Littman gives more
more about constraining the policy graph in section 4.
details on some aspects of the algorithm, and the reader canrefer to his paper to complete the brief description that we
When both the problem and the policy are structured, the
leverage gained can be bigger than just the addition of theeffect of the two structures. For instance, evaluating a
Ordering of free parameters.
RP in a completely deterministic problem can be done in
policies is expanded (in depth-first, breadth-first, or in a
best-first way) by picking the free parameters of the pol-icy one after the other, and considering all possible assign-
FINDING THE OPTIMAL POLICY
ment values for each of them. The game of pruning some
branches based on upper bound/lower bounds comparisonis added onto that. The size of the tree that is actually ex-
In this paper, we consider the problem of finding the best
panded in this process strongly depends on the order in
policy graph of a given size for a POMDP. Littman [14]
which the free parameters are picked. In our case, it is
showed that finding the best RP for a given POMDP is an
NP-hard problem. First, we generalize this result to any
values of ✔ . In other words, when building a solution, we
finite policy graph with a given number of nodes and any
assign actions to the nodes first, and then we fix the struc-
ture of the graph. Otherwise, no pruning is possible beforeall possible structures have been expanded (this is due to
Theorem 3 Given a POMDP and a set of restriction con-
the nature of the upper-bound that we use, see below). In
straints, the problem of finding the optimal deterministicpolicy graph satisfying the constraints is NP-hard.
but before ✔ . There is also an issue with the symmetry ofthe policy-graph space. For instance, in the absence of re-
The proof is straightforward: Finding the best deterministic
striction constraints, we can permute the role played by the
policy graph is equivalent to finding the best deterministic
different nodes without changing the policy. Each policy
RP of the cross-product POMDP. Then the result follows
avoid enumerating equivalent graphs by imposing some ar-bitrary rule when expanding the tree. For instance we can
The techniques for solving NP-hard problems may be clas-
impose that the index of the action attached to a node al-
sified into three groups: global search, local search and
ways be greater or equal to the index of the action of node
approximation algorithms. In this paper, we will use two
, for all . This simple trick can improve greatly the
classic techniques, a global search (section 3.1) and a local
performance of the heuristic search, merely dividing the
search (section 3.2). We will consider a possible approxi-
GLOBAL SEARCH Upper bounds.
A partial solution is a general finite pol-
icy graph with more restriction constraints than initially
A heuristically guided search is used to find the best deter-
(each time we specify an action or a node-transition, we
ministic policy graph of a given size, whatever the restric-
add a constraint). Then we can get an upper bound by solv-
ing the cross-product MDP, as explained in the second part
LOCAL SEARCH
specified policy graph corresponds to an RP of the cross-product (PO)MDP, so no policy graph can do better than
In this approach, we try to find the best stochastic policy
the optimal solution of the cross-product MDP. Note that
graph by treating this problem as a classical non-linear nu-
this upper bound has a nice monotonicity property: it does
merical optimization problem. Since the value of a policy
not increase when we fix a free parameter, and it is equal to
graph is continuous and differentiable with respect to the
the true value of the policy graph when the graph is com-
policy parameters, we can calculate its gradient and climb
pletely specified. On the other hand, as long as no value
it in many different ways. We will not develop all the pos-
sibilities for climbing the gradient here, but we will rather
is specified, the optimal policy found by solving the
cross-product MDP is equivalent to the optimal policy of the
focus on the calculation of the gradient, and then just de-
✰✁☎✄✍✟✬✄✍✌✎✄✝✏✒✑
pict a simple implementation of gradient ascent. Note that
since the gradient may be calculated exactly, this approach
is independent of ✆ and depends only on . Hence,
the calculated upper bound is always equal to the value of
is guaranteed to converge to a local optimum. The topology
the optimal policy of the cross-product MDP. This is why
of the search space, and hence the number of local optima,
no pruning can be done as long as no value of
depends on two things: the structure of the
specified and this parameter must be the considered first
and the constraints imposed on the policy. By introducing
constraints on the policy, we can hope not only to reducethe execution time of the algorithm, but also to change the“landscape” for a less multimodal one. Lower bounds.
order, then we can use the values of the complete poli-
Calculating the gradient.
cies already expanded to determine lower bounds on the
is given by equation (6). For each policy parameter
best performance attainable by extending each partial pol-
icy. Otherwise, we can find a lower bound for a given par-
tial policy by completing it at random and calculating thevalue of the resulting complete policy. An improvement
consists of performing a simple local optimization after
having completed the policy [14]. In our implementation,
we also used a heuristic technique based on the solution of
We are interested in the gradient with respect to the pol-
the cross-product MDP to complete the partial policy. We
icy parameters, i.e., we will consider
calculate the performance of a complete policy by solving
equation (4) in the cross-product MDP.
respect to these variables can be calculated easily startingfrom (2) and (3). The main difficulty in the calculation of
Complexity.
bounds of each node of the expanded tree requires solving
some Bellman equations in the cross-product MDP. Hence
approximation, the basic update rule being:
ber of iterations of successive approximation executed dur-
ing this calculation, one can store, with each partial policy,
, the complexity of a complete back up is
the value function found when calculating its upper bound.
Then we can start the computation of the upper bound of
inverse can be used to calculate the gradient with respect to
a child partial policy starting from the value of its parent.
any parameter . A minor acceleration can be obtained by
Since they are often not very different, we can gain a lot of
time with this trick. However, the memory space needed in-
the iterative computation of this value at the new point. It
creases dramatically. Even if we can calculate the bounds
can reduce the number of iterations of (8) at each step, but
relatively quickly, the real problem is how many nodes it
it is still a matrix-wise DP with a complexity in
will be necessary to expand before reaching the optimum.
In the worst case, the complete tree of all possible solu-tions will be expanded, which represents a complexity ex-
There is another way of computing the gradient with a com-
ponential in the number of degrees of freedom of the policy
graph. In practice, our simulations showed that many fewer
nodes are actually expanded. Note that adding simple con-
several (classical) vector-wise DPs for which complexity
straints on the policy reduces not only the complexity of
, or less if there is some structure in ✄
the search space and hence the number of nodes expanded.
Figure 3: The load/unload problem with 8 locations: the
agent starts in the “Unload” location (U) and receives a re-
ward each time it returns to this place after passing through
the “Load” location (L). The problem is partially observ-
able because the agent cannot distinguish the different lo-
which is also a vector-wise, square-complexity DP. The
cations in between Load and Unload, and because it can-
tion must be re-done for each policy parameter
and ❙ ✳ depend on . Thus, we have divided the complex-
This Monte-Carlo approach works only if there are strong
structural constraints on the graph, and thus cannot be ap-
free variables of the graph. However, this approach will be
plied for finding general finite policy graphs. Note also that
useful in most cases, since there are often many fewer free
Littman reported observing a great superiority (in terms of
variables in the policy graph than “cross-product states”.
execution time) of the global branch-and-bound search over
For instance, if we are looking for the best reactive policy,
the Monte-Carlo approach, in the case where the graph is
then the indirect calculation allows us to gain a factor of
constrained to encode a simple RP. Our simulations with
other architectures (sets of structural constraints) showedsimilar results: in general, the Monte-Carlo approach can-
Climbing the gradient.
the problem is somewhat more complicated since all the
INTRODUCING STRUCTURAL
parameters that we optimize are probabilities and we have
CONSTRAINTS
to ensure that they stay valid (i.e., inside of the simplex)after each update. There are numerous ways for doing that,
Because the majority of their computation is to perform
including renormalizing and using the soft-max function.
Bellman back-ups in the cross-product MDP, the algorithms
In our implementation, we chose to project the calculated
outlined above can take advantage of any preexisting struc-
gradient on the simplex, and then apply it until we reach an
ture in the POMDP. However, this leverage can be insuffi-
edge of the simplex. If we reach an edge and the gradient
cient if the problem is too big or too difficult for the two
points outside of the simplex, then we project the gradient
techniques. In this case, one may whish to restrict further
the search space by imposing structural constraints on thepolicy graph. For instance, a simple solution consists of
Related work.
The idea of using a gradient algorithm for
defining a neighborhood for the nodes of the graph, and
solving POMDPs has already been pursued by several au-
allowing transitions only to a neighboring node. This cor-
thors [2, 12, 27]. The main difference between this work
responds to a set of restriction constraints (✔ is forced to
and ours is that these authors use a Monte-Carlo estima-
take the value zero in many points), and hence the algo-
tion of the gradient instead of an exact calculation, and that
rithm above can still be applied. A somehow extreme so-
they limit themselves to RPs, which is much less general
lution consists of limiting the search to reactive policies
than our approach. Moreover, Jaakkola et al. do not use the
(then ✔ is completely fixed in advance). More complex sets
exponentially discounted criterion (1), but the average re-
of constraints can also be used, for instance, we can limit
ward per time step. In a companion paper [17], we propose
the search to policy representable as a finite sequence of
a stochastic gradient descent approach for learning finite
RPs (with particular rules governing the transition from one
policy graph during a trial-based interaction with the pro-
RP to another), as in Wiering and Schmidhuber’s HQL [26].
Other instances include the finite-horizon memory policiesused by McCallum [15, 16], or the external-memory poli-
OTHER APPROACHES
cies used by Peshkin et al. [19]. Although these architec-tures cannot be described only in terms of restriction con-
A Monte-Carlo approach based on Watkins’Q-learning
straints (there are also equality constraints between differ-
[25, 24] is also applicable to our problem. For instance, we
ent parameters of the graph), the previous results and al-
can an use Q-learning based on observation-action pairs to
gorithm can be extended to each of them in particular. In
find (with no guarantee of convergence) the optimal RP for
other words, we can use the previous algorithm to find
a POMDP [14]. Another instance is Wiering and Schmid-huber’s HQL [26], which learns finite sequences of RPs.
Figure 5: A partially observable stochastic maze: the agent
must go from the starting state marked with an “S”to the
goal marked with an “G”. The problem is partially observ-
able because the agent cannot perceive its true location, butonly its orientation and the presence or the absence of a
Figure 4: Simulations results obtained with the load/unload
wall on each side of the square defining its current state.
problem: execution time of the algorithms as a function of
The problem is stochastic because there is a non-zero prob-
the size of the problem (ga: gradient ascent, dfh: depth first
ability of slipping, so that the agent does not always know
heuristic search, bfh: breadth first heuristic search).
if its last attempt to make a move had any consequence onits actual position in the maze.
the best policy using a given finite-horizon memory,
the best policy using an external-memory of a given
easy POMDP, the results obtained represent a kind of upper
bound on the performance of the algorithms. It is unlikelythat they will perform better on another (harder) problem.
(we can also show that it is NP-hard to solve these prob-
2 and the gradient algorithm always started from the centerof the simplex (i.e., the policy graph is initialized with uni-
What do we gain and what do we lose when we impose
form distributions).1 We measured the time of execution of
structural constraints reduces the number of parameters per
ascent, we stopped when we reached 99% of the optimal.
node (which should help both techniques), and modifies the
When the heuristic search uses a stochastic calculation of
topology of the search space (which influences the gradi-
upper bounds, we average the measure over 50 runs.
ent descent approach). Another point is that the best graph
set to 0.996 (a big value is necessary to accommodate big
without the constraints can be better than the best graph
state-spaces), and the learning rate of gradient descent was
with the constraints, i.e., the constraints can decrease the
optimized. The results are given in figure 4. They show
value of the best solution. Even if this does not happen,
that the heuristic search clearly outperforms the gradient
more nodes may be required to reach the best performance
algorithm, which becomes numerically unstable when the
with the constraints than without. Consider, for instance,
number of states increases in this kind of geometrically-
the load/unload problem represented in figure 3. This sim-
ple problem is solved optimally with a two-node policygraph, or with a sequence of two RPs as used in HQL. As an
In the second set of experiments, we wanted to measure
how far our algorithms can go in terms of problem size,
in a problem more difficult than the simple load/unload.
the number of parameters per node will be smaller than in
We used the a set of partially observable mazes with the
the unconstrained case. In general, adding structure will be
regular structure represented in figure 5, and whose size
interesting if we choose an architecture that fits the prob-
lem at hand. Hence, it is a question of previous knowledge
mazes are not particularly easy, since they have only two
about the problem at hand and its optimal solution.
different optimal paths. The minimal number of nodes forsolving them is 4, one per action (although the policy is not
SIMULATION RESULTS
reactive). The time required for the (depth first) branch-and-bound algorithm to find the optimal solution with thisoptimal number of nodes is shown in figure 6. We see that
In our first experiments, we used the simple load/unloadproblem of figure 3 with an increasing number of loca-
1We used a simpler version of the algorithms where the start-
tions, to see how both algorithms scale up to increasing
ing node is fixed. Otherwise, the policy using only uniform dis-
problem-size, and how they compare. Since it is a very
tributions is a (very unstable) local optimum.
tional leverage. Journal of AI Research, To appear,
[4] A.R. Cassandra. Exact and Approximate Algorithmsfor Partially Observable Markov Decision Processes.
[5] A.R. Cassandra, L.P. Kaelbling, and M.L. Littman.
Acting optimally in partially observable stochastic
domains. In Proceedings of the Twelfth National Con-ference on Artificial Intelligence, 1994.
[6] E.A. Hansen. An improved policy iteration algorithm
for partially observable MDPs. In Advances in Neu-ral Information Processing Systems, 10. MIT Press,
Figure 6: Performance of the branch-and-bound algorithm
on the maze problem: execution time as a function of thenumber of states. Finite-Memory Control of Partially
Computer Science, University of Massachusetts at
we can solve a partially observable maze with almost 1000
states in a less than 6 hours. It represents a performancefar above the capacities of classic approaches for solving
[8] E.A. Hansen. Solving POMDPs by searching in pol-
POMDPs. Note also that, as the number of states grows,
icy space. In Proceedings of the Eighth Conference on
the measured complexity is almost linear in the number of
Uncertainty in Artificial Intelligence, pages 211–219,
CONCLUSION
[9] M. Hauskrecht. Planning and Control in Stochas-tic Domains with Imperfect Information. PhD thesis,MIT, Cambridge, MA, 1997.
We studied the problem of finding the optimal policy rep-resentable as a finite state automaton of a given size, pos-
[10] O. Higelin. Optimal Control of Complex Structured
sibly with some simple structural constraints. This ap-
Processes. PhD thesis, University of Caen, France,
proach by-passes the continuous and intractable belief-state
space. However, we showed that we end up with a NP-hard problem anyway. Then we proposed to use two classic
[11] R.A. Howard. Dynamic Programming and Markov
search techniques, and developed efficient implementations
Processes. MIT Press, Cambridge, 1960.
of them that allow using the structure of the problem to ac-celerate the computation. If this is not sufficient, bigger
[12] T. Jaakkola, S. Singh, and M.R. Jordan.
leverage can be gained by imposing structure on the policy.
forcement learning algorithm for partially observable
However, our algorithms are limited by necessity to enu-
Markov problems. In Advances in Neural Informa-
merate at least once per iteration, the complete state space
tion Processing Systems, 7. MIT Press, Cambridge,
of the POMDP. In a companion paper [17], we propose an
indirect learning algorithm that avoids this bottleneck.
[13] L.P. Kaelbling, M.L. Littman, and A.R. Cassandra.
Planning and acting in partially observable stochastic
References
domains. Artificial Intelligence, 101, 1998.
[1] K.J. Astrom. Optimal control of Markov decision pro-
[14] M.L. Littman. Memoryless policies: Theoretical lim-
cesses with incomplete state estimation. J. Math. Anl.Animats 3: Proceedings of the Third International
[2] L.C. Baird and A.W. Moore. Gradient descent for
Conference on Simulation of Adaptive Behavior. MIT
general reinforcement learning. In Advances in Neu-ral Information Processing Systems, 12. MIT Press,Cambridge, MA, 1999.
[15] R.A. McCallum. Overcoming incomplete perception
with utile distinction memory. In The Proceedings
[3] C. Boutillier, T.L. Dean, and S. Hanks. Decision the-
of the Tenth International Machine Learning Confer-
oretic planning: structural assumptions and computa-
[16] R.A. McCallum. Reinforcement Learning with Selec-tive Perception and Hidden State. PhD thesis, Univer-sity of Rochester, Rochester, NY, 1995.
[17] N. Meuleau, L. Peshkin, K.E. Kim, and L.P. Kael-
bling. Learning finite-state controllers for partiallyobservable environments. teenth Conference on Uncertainty in Artificial Intel-ligence, To appear, 1999.
[18] R. Parr and S. Russell. Reinforcement learning with
hierarchies of machines. In Advances in Neural In-formation Processing Systems 11. MIT Press, Cam-bridge, MA, 1998.
[19] L. Peshkin, N. Meuleau, and L.P. Kaelbling. Learn-
ing policies with external memory. Proceedings ofthe Sixteenth International Conference on MachineLearning, To appear, 1999.
[20] M.L. Puterman. Markov Decision Processes: Dis-crete Stochastic Dynamic Programming. Wiley, NewYork, NY, 1994.
[21] J.K. Satia and R.E. Lave. Markov decision processes
with probabilistic observation of states. ManagementScience, 20(1):1–13, 1973.
[22] R.D. Smallwood and E.J. Sondik. The optimal con-
trol of partially observable Markov decision processesover a finite horizon. Operations Research, 21:1071–1098, 1973.
[23] E.J. Sondik. The optimal control of partially observ-
able Markov decision processes over the infinite hori-zon: Discounted costs.
[24] R.S. Sutton and A.G. Barto. Reinforcement Learning:An Introduction. MIT Press, Cambridge, MA, 1998.
[25] C. Watkins. Learning from Delayed Rewards. PhD
thesis, King’s College, Cambridge, 1989.
[26] M. Wiering and J. Schmidhuber. HQ-Learning. Adap-tive Behavior, 6(2):219–246, 1997.
[27] R.J. Williams. Towards a theory of reinforcement-
port NU-CCS-88-3, Northeastern University, Boston,MA, 1988.

DIÁRIO DA REPÚBLICA — I SÉRIE-A N.o 178 — 15 de Setembro de 2005 PRESIDÊNCIA DA REPÚBLICA 2005, o Decreto do Presidente da República n.o 42/2005,de 2 de Agosto, rectifica-se que onde se lê «ministroplenipotenciário de 1.a classe Joaquim José Ferreira da Declaração de Rectificação n.o 67/2005 Fonseca Embaixador de Portugal no Panamá» devePor ter sido publicado com ine

Leserbriefinfos Nr. 43 1. Renaissance der Reaktoren . Die Kernkraft kommt zurück. Das ist der Tenor der Energiedebatte in den USA. Ab 2010 müssen in den USA 500.000 MW Kraftwerkskapazität ersetzt werden. Seit Mitte der 70er Jahre ist in den USA kein neues Kernkraftwerk (KKW) mehr bestellt worden. Das war auch nicht nötig, denn die Verfügbarkeit der KKW wurde durch die Betriebser