Application Domains
New Software and Platforms
Bilateral Contracts and Grants with Industry
Partnerships and Cooperations
Bibliography
 PDF e-Pub

## Section: Research Program

### Sequential Decision Making

#### Synopsis and Research Activities

Sequential decision making consists, in a nutshell, in controlling the actions of an agent facing a problem whose solution requires not one but a whole sequence of decisions. This kind of problem occurs in a multitude of forms. For example, important applications addressed in our work include: Robotics, where the agent is a physical entity moving in the real world; Medicine, where the agent can be an analytic device recommending tests and/or treatments; Computer Security, where the agent can be a virtual attacker trying to identify security holes in a given network; and Business Process Management, where the agent can provide an auto-completion facility helping to decide which steps to include into a new or revised process. Our work on such problems is characterized by three main lines of research:

• (A) Understanding how, and to what extent, to best model the problems.

• (B) Developing algorithms solving the problems and understanding their behavior.

• (C) Applying our results to complex applications.

Before we describe some details of our work, it is instructive to understand the basic forms of problems we are addressing. We characterize problems along the following main dimensions:

• (1) Extent of the model: full vs. partial vs. none. This dimension concerns how complete we require the model of the problem – if any – to be. If the model is incomplete, then learning techniques are needed along with the decision making process.

• (2) Form of the model: factored vs. enumerative. Enumerative models explicitly list all possible world states and the associated actions etc. Factored models can be exponentially more compact, describing states and actions in terms of their behavior with respect to a set of higher-level variables.

• (3) World dynamics: deterministic vs. stochastic. This concerns our initial knowledge of the world the agent is acting in, as well as the dynamics of actions: is the outcome known a priori or are several outcomes possible?

• (4) Observability: full vs. partial. This concerns our ability to observe what our actions actually do to the world, i.e., to observe properties of the new world state. Obviously, this is an issue only if the world dynamics are stochastic.

These dimensions are wide-spread in the AI literature and are not exhaustive, in particular the MAIA team is also interested by discrete/continuous or centralized/decentralized problems. The complexity of solving a problem – both in theory and in practice – depends heavily on where it resides in this categorization. A common practice is to address simplified problems, leading to perhaps sub-optimal solutions while trying to characterize how far from the optimal solution we stand.

In what follows, we outline the main formal frameworks on which our work is based; while doing so, we highlight in a little more detail our core research questions. We then give a brief summary of how our work fits into the global research context.

#### Formal Frameworks

##### Deterministic Sequential Decision Making

Sequential decision making with deterministic world dynamics is most commonly known as planning, or classical planning [49] . Obviously, in such a setting every world state needs to be considered at most once, and thus enumerative models do not make sense (the problem description would have the same size as the space of possibilities to be explored). Planning approaches support factored description languages in which complex problems can be modeled in a compact way. Approaches to automatically learn such factored models do exist, however most works – and also most of our works on this form of sequential decision making – assume that the model is provided by the user of the planning technology. Formally, a problem instance, commonly referred to as a planning task, is a four-tuple $〈V,A,I,G〉$. Here, $V$ is a set of variables; a value assignment to the variables is a world state. $A$ is a set of actions described in terms of two formulas over $V$: their preconditions and effects. $I$ is the initial state, and $G$ is a goal condition (again a formula over $V$). A solution, commonly referred to as a plan, is a schedule of actions that is applicable to $I$ and achieves $G$.

Planning is PSPACE-complete even under strong restrictions on the formulas allowed in the planning task description. Research thus revolves around the development and understanding of search methods, which explore, in a variety of different ways, the space of possible action schedules. A particularly successful approach is heuristic search, where search is guided by information obtained in an automatically designed relaxation (simplified version) of the task. We investigate the design of relaxations, the connections between such design and the search space topology, and the construction of effective planning systems that exhibit good practical performance across a wide range of different inputs. Other important research lines concern the application of ideas successful in planning to stochastic sequential decision making (see next), and the development of technology supporting the user in model design.

##### Stochastic Sequential Decision Making

Markov Decision Processes (MDP) [51] are a natural framework for stochastic sequential decision making. An MDP is a four-tuple $〈S,A,T,r〉$, where $S$ is a set of states, $A$ is a set of actions, $T\left(s,a,{s}^{\text{'}}\right)=P\left({s}^{\text{'}}|s,a\right)$ is the probability of transitioning to ${s}^{\text{'}}$ given that action $a$ was chosen in state $s$, and $r\left(s,a,{s}^{\text{'}}\right)$ is the (possibly stochastic) reward obtained from taking action $a$ in state $s$, and transitioning to state ${s}^{\text{'}}$. In this framework, one looks for a strategy: a precise way for specifying the sequence of actions that induces, on average, an optimal sum of discounted rewards $\text{E}\left[{\sum }_{t=0}^{\infty }{\gamma }^{t}{r}_{t}\right]$. Here, $\left({r}_{0},{r}_{1},...\right)$ is the infinitely-long (random) sequence of rewards induced by the strategy, and $\gamma \in \left(0,1\right)$ is a discount factor putting more weight on rewards obtained earlier. Central to the MDP framework is the Bellman equation, which characterizes the optimal value function ${V}^{*}$:

$\forall s\in S,\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}{V}^{*}\left(s\right)=\underset{a\in A}{max}\sum _{{s}^{\text{'}}\in S}T\left(s,a,{s}^{\text{'}}\right)\left[r\left(s,a,{s}^{\text{'}}\right)+\gamma {V}^{*}\left({s}^{\text{'}}\right)\right].$

Once the optimal value function is computed, it is straightforward to derive an optimal strategy, which is deterministic and memoryless, i.e., a simple mapping from states to actions. Such a strategy is usually called a policy. An optimal policy is any policy ${\pi }^{*}$ that is greedy with respect to ${V}^{*}$, i.e., which satisfies:

$\forall s\in S,\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\pi \left(s\right)\in {argmax}_{a\in A}\sum _{{s}^{\text{'}}\in S}T\left(s,a,{s}^{\text{'}}\right)\left[r\left(s,a,{s}^{\text{'}}\right)+\gamma {V}^{*}\left({s}^{\text{'}}\right)\right].$

An important extension of MDPs, known as Partially Observable MDPs (POMDPs) allows to account for the fact that the state may not be fully available to the decision maker. While the goal is the same as in an MDP (optimizing the expected sum of discounted rewards), the solution is more intricate. Any POMDP can be seen to be equivalent to an MDP defined on the space of probability distributions on states, called belief states. The Bellman-machinery then applies to the belief states. The specific structure of the resulting MDP makes it possible to iteratively approximate the optimal value function – which is convex in the belief space – by piecewise linear functions, and to deduce an optimal policy that maps belief states to actions. A further extension, known as a DEC-POMDP, considers $n\ge 2$ agents that need to control the state dynamics in a decentralized way without direct communication.

The MDP model described above is enumerative, and the complexity of computing the optimal value function is polynomial in the size of that input. However, in examples of practical size, that complexity is still too high so naïve approaches do not scale. We consider the following situations: (i) when the state space is large, we study approximation techniques from both a theoretical and practical point of view; (ii) when the model is unknown, we study how to learn an optimal policy from samples (this problem is also known as Reinforcement Learning [55] ); (iii) in factored models, where MDP models are a strict generalization of classical planning – and are thus at least PSPACE-hard to solve – we consider using search heuristics adapted from such (classical) planning.

Solving a POMDP is PSPACE-hard even given an enumerative model. In this framework, we are mainly looking for assumptions that could be exploited to reduce the complexity of the problem at hand, for instance when some actions have no effect on the state dynamics (active sensing). The decentralized version, DEC-POMDP, induces a significant increase in complexity (NEXP-complete). We tackle the challenging – even for (very) small state spaces – exact computation of finite-horizon optimal solutions through alternative reformulations of the problem. We also aim at proposing advanced heuristics to efficiently address problems with more agents and a longer time horizon.