Section: Scientific Foundations
Stochastic models
Objectives
We develop algorithms for stochastic models applied to machine learning and decision. On the one hand, we consider standard stochastic models (Markov chains, Hidden Markov Models, Bayesian networks) and study the computational problems that arise, such as inference of hidden variables and parameter learning. On the other hand, we consider the parameterized version of these models (the parameter can be seen as a control/decision of an agent); in these models (Markov decision processes, partially observable Markov decision processes, decentralized Markov decision processes, stochastic games), we consider the problem of a) planning and b) reinforcement learning (estimating the parameters and planning) for one agent and for many agents. For all these problems, our aim is to develop algorithmic solutions that are efficient, and apply them to complex problems.
In the following, we concentrate our presentation on parameterized stochastic models, known as (partially observable) Markov decision processes, as they trivially generalize the nonparameterized models (Markov chain, Hidden Markov Models). We also outline how these models can be extended to multiagent settings.
A general framework
An agent is anything that can be viewed as sensing its environment through sensors and acting upon that environment through actuators. This view makes Markov decision processes (MDPs) a good candidate for formulating agents. It is probably why MDPs have received considerable attention in recent years by the artificial intelligence (AI) community. They have been adopted as a general framework for planning under uncertainty and reinforcement learning.
Formally, a Markov decision process is a fourtuple , where:

S is the state space,

A is the action space,

P is the statetransition probability function that models the dynamics of the system. P(s, a, s^{'}) is the probability of transitioning from s to s^{'} given that action a is chosen.

r is the reward function. r(s, a, s^{'}) stands for the reward obtained from taking action a in state s , and transitioning to state s^{'} .
With this framework, we can model the interaction between an agent and an environment. The environment can be considered as a Markov decision process which is controlled by an agent. When, in a given state s , an action a is chosen by the agent, the probability for the system to get to state s^{'} is given by P(s, a, s^{'}) . After each transition, the environment generates a numerical reward r(s, a, s^{'}) . The behaviour of the agent can be represented by a mapping :SA between states and actions. Such a mapping is called a policy.
In such a framework, we consider the following problems:

Given the explicit knowledge of the problem (that is P and r ), find an optimal behaviour, i.e. , the policy which maximizes a given performance criteria for the agent. There are three popular performance criteria to evaluate a policy:

Given the ability to interact with the environment (that is, samples of P and r obtained by simulation or realworld interaction), find an optimal behaviour. This amounts to learning what to do in each state of the environment by a trial and error process and such a problem is usually called reinforcement learning . It is, as stated by Sutton and Barto [56] , an approach for understanding and automating goaldirected learning and decisionmaking that is quite different from supervised learning. Indeed, it is in most cases impossible to get examples of good behaviors for all situations in which an agent has to act. A tradeoff between exploration and exploitation is one of the major issues to address.

Furthermore, a general problem, which is useful for the two previous problems, consists in finding good representations of the environment so that an agent can achieve the above objectives.
In a more general setting, an agent may not perceive the state in which he stands. The information that an agent can acquire on the environment is generally restricted to observations which only give partial information about the state of the system. These observations can be obtained for example using sensors that return some estimate of the state of the environment. Thus, the decision process has hidden state, and the issue of finding an optimal policy is no more a Markov problem. A model that describes such an hiddenstate and observation structure is the POMDP (partially observable MDP). Formally, a POMDP is a tuple where

S , A , P and r are defined as in an MDP.

is a finite set of observations.

O is a table of observation probabilities. O(s, a, s^{'}, o) is the probability of transitioning from s to s^{'} on taking action a in s while observing o . Here s , s^{'}S , aA , o .
Hidden Markov Models are a particular case of POMDP in which there is no action and no reward. Based on the mathematical framework, several learning algorithms can be used in dealing with diagnosis and prognosis tasks. Given a proper description of the state of a system, it is possible to model it as a Markov chain. The dynamics of the systems is modeled as transition probabilities between states. The information that an external observer of the system can acquire about it can be modeled using observations which only give partial information on the state of the system. The problem of diagnosis is then to find the most likely state given a sequence of observations. Prognosis is akin to predicting the future state of the system given a sequence of observation and, thus, is strongly linked to diagnosis in the case of Hidden Markov Model. Given a proper corpus of diagnosis examples, AI algorithms enable the automated learning of an appropriate Hidden Markov Model that can be used for both diagnosis and prognosis. Rabiner [50] gives an excellent introduction to HMM and describes the most frequently used algorithms.
While substantial progress has been made in planning and control of single agents, a similar formal treatment of multiagent systems is still missing. Some preliminary work has been reported, but it generally avoids the central issue in multiagent systems: agents typically have different information and different knowledge about the overall system and they cannot share all this information all the time. To address the problem of coordination and control of collaborative multiagent systems, we are conducting both analytical and experimental research aimed at understanding the computational complexity of the problem and at developing effective algorithms for solving it. The main objectives of the project are:

To develop a formal foundation for analysis, algorithm development, and evaluation of different approaches to the control of collaborative multiagent systems that explicitly captures the notion of communication cost.

To identify the complexity of the planning and control problem under various constraints on information observability and communication costs.

To gain a better understanding of what makes decentralized planning and control a hard problem and how to simplify it without compromising the efficiency of the model.

To develop new generalpurpose algorithms for solving different classes of the decentralized planning and control problem.

To demonstrate the applicability of new techniques to realistic applications and develop evaluation metrics suitable for decentralized planning and control.
In formalizing coordination, we take an approach based on distributed optimization, in part because we feel that this is the richest of such frameworks: it handles coordination problems in which there are multiple and concurrent goals of varying worth, hard and soft deadlines for goal achievement, alternative ways of achieving goals that offer a trade off between the quality of the solution and the resources required. Equally important is the fact that this decisiontheoretic approach allows us to model explicitly the effects of environmental uncertainty, incomplete and uncertain information and action outcome uncertainty. Coping with these uncertainties is one of the key challenges in designing sophisticated coordination protocols. Finally, a decisiontheoretic framework is the most natural one for quantifying the performance of coordination protocols from a statistical perspective.
Contemporary similar or related work in national and international laboratories
As far as stochastic planning is concerned, since the mid1990s, models based on Markov decision processes have been increasingly used by the AI research community, and more and more researchers in this domain are now using MDPs. In association with the ARC INRIA LIRE and with P. Chassaing of the OMEGA project, our research group has contributed to the development of this field of research, notably in coorganizing workshops for the AAAI, IJCAI and ECAI conferences. We also maintain vivid collaborations with S. Zilberstein (on two NSFINRIA projects) and with NASA (on a project entitled “Selfdirected cooperative planetary rovers”) in association with S. Zilberstein and V. Lesser of the University of Massachusetts, E. Hansen of the Mississippi State University, R. Washington now at Google and A.I. Mouaddib of GREYC, Caen.
We have been using the strengths of the basic theoretical properties of the two major approaches for learning and planning that we follow, to design exact algorithms that are able to deal with practical problems of high complexity. Instances of these algorithms include the JLO algorithm for Bayesian networks, the Qlearning, TD( ) and Witness algorithms for problems based on the Markov decision process formalism, etc. While it is true that the majority of this work has been done in the United States, the French research community is catching up quickly by developing further this domain on its own. MAIA has been involved directly in making substantial contributions to this development, notably through our active participation in the (informally formed) group of French researchers working on MDPs. Thus, today there is a growing number of research labs in France with teams working on MDPs. To name a few, Toulousebased labs such as IRIT, CERT, INRA, LAAS, etc., the GREYC at Caen, INRIA Lille Nord Europe and Paris.
Most of the current work is focused on finding approximate algorithms. Besides applying these algorithms to a multiagent system (MAS) framework, we have also been focusing on reducing the complexity of implementing these algorithms by making use of the metaknowledge available in the system being modeled. Thus in implementing the algorithms, we seek temporal, spatial and structural dynamics or functions of the given problem. This is timeeffective in finding approximate solutions of the problem. Moreover, we are seeking ways to combine rigorously these two forms of learning, and then to use them for applications involving planning or learning for agents located in an environment.