Section: New Results
Stochastic models
The keyword for our recent work on stochastic models is “distributed”. In terms of decentralized control, we have developed exact and approximate methods for the Decentralized Partially Observable Markov Decision Processes framework (DECPOMDP) and investigated the use of game theory inspired concepts for learning to coordinate. We have also unveiled strong links between optimal and harmonic control and discussed some implications of these links for the distributed computation of optimal trajectories.
Decentralized Stochastic Models
There is a wide range of application domains in which decisionmaking must be performed by a number of distributed agents that try to achieve a common goal. This includes informationgathering agents, distributed sensing, coordination of multiple distributed robots, decentralized control of a power grid, autonomous space exploration systems, network traffic routing, decentralized supply chains, as well as the operation of complex human organizations. These domains require the development of a strategy for each decision maker assuming that decision makers will have limited ability to communicate when they execute their strategies, and therefore will have different knowledge about the global situation.
Our research team is focusing on the development of a decisiontheoretic framework for such collaborative multiagent systems. The overall goal is to develop sophisticated coordination strategies that stand on a formal footing. This enabled us to better understand the strengths and limitations of existing heuristic approaches to coordination and, more importantly, to develop new approaches based on these more formal underpinnings. One important result is that we are showing that the theory of Markov Decision Processes is particularly powerful in this context. In particular, we are extending the MDP framework to problems of decentralized control.
By relying on concepts coming from the Decision Theory and Game Theory, we have proposed some algorithms for decentralized stochastic models. These new results are related to both planing and learning. This work has been supported partly by the INRIA associated team Umass with S. Zilberstein.
Building multiagent systems by the use interactions and RL techniques
Participant : Vincent Thomas.
Mahuna Akplogan participated during his internship.
The DECPOMDP model, proposed by S. Zilberstein in 2000, was one of the first models to formally describe distributed decision problems, but works have proven that building the optimal policies of agents in this context is in practice intractable (NEXP complexity). Our work is based on the constatation that the interactions among agents which can structure the problem are not explicitely represented. We assume that this can be one of the reasons why solving DECPOMDP is a difficult issue and that representing interactions can open new perspectives in collective reinforcement learning.
Guided by these hypotheses, in the past, we have proposed 1. an original formalism, the InteracDECPOMDP, in which interactions are explicitely represented so that agents can reason about the use of interactions and their relationships with others and 2. a new generalpurpose decentralized learning algorithm based on heuristic distribution of rewards among agents during interactions to build their policies. However, it was difficult to compare InteracDECPOMDPs with DECPOMDPs due to the particular structure of InteracDECPOMDPs.
We are currently poursuing a similar approach through the concept of social actions as a way to represent actions and interactions in a similar manner to be closer to DECPOMDP formalism while allowing the agents to learn and to reason on the interactions with the other agents of the system [29] .
Investigation in Game Theory inspired Decentralized Reinforcement Learning
Participants : François Charpillet, Alain Dutech.
Raghav Aras is a former PhD student of MAIA and now an external collaborator.
Studying Decentralized Reinforcement Learning, so as to allow MultiAgent Systems to learn to coordinate, from the point of view of Game Theory lead us to formulate a new approach for solving DecPOMDP. This new formulation can also be applied for POMDP.
More specifically, we address the problem of finding exact solutions for finitehorizon decentralized decision processes for n agents where n is greater than two. Our new approach is based on two ideas:

we represent each agent's policy in the sequenceform and not in the treeform, thereby obtaining a very compact representation of the set of jointpolicies.

using this compact representation, we solve this problem as an instance of combinatorial optimization for which we formulate a mixed integer linear program (MILP).
Our new algorithm has been experimentally validated on several classical problems often used in the DecPOMDP community.
The impact of our new approach is still to be evaluated. If our algorithm is quicker than other exact algorithms, the improvement is not very large. A valid question is to know if our new approach can inspire new algorithms for either infinitehorizon problem or for finding approximate solutions to finitehorizon problems. This work has been concretized this year by a major publication: [3]
Approximated Pointbased dynamic programming for DECPOMDPs
Participants : François Charpillet, Gabriel Corona.
Planning in the DecPMDP framework has been shown to be very difficult: it is NEXPcomplete in a finite horizon. The exact dynamic programming construction of policy trees is generally exponential in the planning horizon and the number of agents and doubly exponential in the number of observations.
Up to recently, problems of very small horizon could be solved. Recent approximate point based memory bounded approaches have been able to find plans for higher horizons: MBDP (Memory Bounded Dynamic Programming), IMBDP (Improved MBDP), MBDPOC (MBDP with Observation Compression), PBIP (Point Based Incremental Pruning). They bound the number of policy trees to a fixed value, maxTrees. maxTrees points are generated as approximations of the prior probability distribution on states and for each point the best policy tree is kept for each agent.
We are trying to see whether probabilistic heuristic information (currently in the form of prior probability distribution over beliefs) can be generated and used to compute solutions of better quality. Based on this idea, this year [14] , we have proposed a new approach for point based memory bounded dynamic programming planning using this information which, we expect, will lead to solutions of better quality: the problem of choosing the policy trees is formulated as a combinatorial optimisation problem whose objective is to maximise the expectation (given the heuristics distribution) of the sum of subsequent rewards.
In practice, the results heavily depends on the heuristics chosen. Overall, the approach is able to find very good solutions often much better than MBDP. The computations time is lower than MBDP and often by an order or two of magnitude.
In the future, we expect to adapt the approach in order to scale better with the number of observations: this would enable us to compare with algorithms such as IMBDP, MBDPOC and PBIP on problems with a higher number of observations. A longer term goal is to see whether the approach may be used to scale better with the number of agents.
Utility Functions vs Preference Relations in Recommenders
Participant : Olivier Buffet.
Anne Boyer, Armelle Brun and Ahmad Hamad (Kiwi team, Loria) are external collaborators.
Classical approaches in recommender systems need ratings (i.e. utilities) to represent user preferences and deduce unknown preferences of other users. In this work, we focused on an original way to represent preferences under the form of preference relations. This approach can be viewed as a qualitative representation of preferences at the opposite of the usual quantitative representation. The only information known is whether a user prefers one item over an other item. “How much” this item is preferred is not known. This approach has the advantage not to require users to rate items in a small rating scale of integer values, which may be a difficult task and the resulting ratings may highly depend on the user, his mood, the preceding items he has rated, etc. We have proposed to adapt classical measures that exploit utilities so as to exploit preference relations, such as the similarity measure between users. First experiments have been conducted on a wellknown user data set that represents user utilities. We have transformed this data set under the form of preference relations. First results have shown that this approach leads to comparable performance with the classical approach. This work has been accepted for publication in the proceedings of the French conference RFIA 2010 [30] .
Active Sensing
Participants : Mauricio Araya, Vincent Thomas, Olivier Buffet, François Charpillet.
A large class of sequential decision making problems — namely active sensing — is concerned with acting so as to maximize the acquired information. Such problems can be cast as a special form of Partially Observable Markov Decision Processes where 1) the reward signal is linked to the information gathered and 2) there may be no actions with effects on the state of the system. These problems imply reasoning about belief states, and therefore involve dealing with large (if not continuous) state spaces.
Preliminary experiments have been conducted on a “hide and seek” problem where a predator wants to locate a prey in a complex environment. Using a probabilistic occupancy map, we have investigated two interesting heuristic approaches:

to keep on moving towards the point of highest occupancy probability; and

to search for the sequence of moves that is likely to maximize the acquired information (under some simplifying assumptions).
More recently, an indepth study of existing work led us to a typology of active sensing problems and approaches. Our objective is to better understand this problem class and to possibly identify some specific structures that could be exploited.
This research will be further developed in the context of the COMAC project (Section 8.1.5.3 ) concerned with the low cost identification of defaults in aeronautics parts made of composite materials by selecting the observations to perform (which sensor, where, at which resolution), with a possible extension to multiple collaborative active sensing agents.
Addressing large optimal control problems
Temporal Difference Based Policy Iteration
Participants : Bruno Scherrer, Christophe Thiery.
We have deepened our understanding and analysis of the algorithm Policy Iteration (or Temporal Difference Based Policy Iteration), which generalized Value Iteration and Policy Iteration by introducing a parameter (0, 1) that allows to continuously vary from one algorithm to the other. In [35] , we have proposed a modified version of this algorithm, which is analogous the wellknown modified version of Policy Iteration. We have proved that it converges to the optimal solution. Using analytical and empirical algorithm, we have underlined the fact that values of smaller than 1 are not interesting when computations are made exactly. We expect that this parameter will be useful in an approximate setting, and this is currently under investigation.
Building controllers for the game of Tetris
Participants : Bruno Scherrer, Christophe Thiery.
The game of Tetris is a very large (and therefore challenging) optimal control problem. In [16] , [13] , we consider the problem of designing a controller for this game. We use the crossentropy method to tune a ratingbased onepiece controller based on several sets of features among which some original features. This approach leads to a controller that outperforms the previous known results. On the original game of Tetris, we show that with probability 0.95 it achieves at least lines per game on average. On a simplified version of Tetris considered by most research works, it achieves lines per game on average.
In a more general perspective, we wrote in [12] a review article on this game. We provide a list of all the features we could find in the literature and in implementations, and mention the methods that have been used for weight optimization. We also highlight the fact that performance measures for Tetris must be compared with great care, as they have a very large variance, and as subtle implementation choices can have a significant effect on the resulting scores. An immediate interest of this review is illustrated: simply gathering ideas from different works, we show how we built a controller that outperforms the previously known best controllers, and shorlty discuss how it allowed us to win the Tetris domain of the 2008 Reinforcement Learning Competition.
Multiprocessor Realtime Scheduling under Uncertainty
Participant : Olivier Buffet.
L. CucuGrosjean (TRIO team, LORIA) is an external collaborator.
Many embedded systems (e.g. in cars or planes) have to treat repetitive tasks (with different periods) using several processors. However, existing work focuses on distributing jobs on the processors under the assumption that their execution time is fixed, which requires considering the worstcase execution time.
Up to now, we have considered the problem of scheduling jobs over processors in this worstcase deterministic scenario. We have formalized this problem as a constraint satisfaction problem (CSP) and studied various exact and heuristic resolution algorithms [21] .
Our objective is to turn to the uncertain scenario where probability distributions over task durations are known. This will require modelling the problem as an MDP and looking for the most appropriate resolution techniques.
Reinforcement Learning in Robotics
Participant : Alain Dutech.
Nicolas Beaufort and Jérôme Bechu during their internship.
Applying Reinforcement Learning on a robot is a difficult task because of the following limitations:

Learning must deal with continuous state and action spaces.

Learning must able to take advantage of very few experiences as the cost to get new experiences can be high, especially when time is concerned.
Nevertheless, by taking inspiration from the work of W. Smart [55] , we investigated the notion of efficient reinforcement learning. We designed a simple artificial experiment where a WifiBot has to detect and move to a given target before stopping in front of it. Provided the robot can detect and identify the target with its camera, showing the robot a path to the target 10 or 20 times is enough for it to learn to do it by itself. This was achieved by combining Peng's eligibility traces [49] with locally weighted approximation of the QValue of the MDP underlying the behavior of the robot.
Using the newest KheperaIII robot, we also worked on a indirect reinforcement learning algorithm. The goal of the robot is to learn a model of its environement using an approximation of the transition probabilities associated to its various action with only the smallest amount of external guidance by a human operator. The model is learned on a continuous state space and we have reused the locally weighted approximation algorithm previously developed. One of the strong point of our approch is a strong coupling between reinforcement learning induced behavior and more basic behavior (like obstacle avoidance) in a kind of subsomption architecture that allow the agent to efficiently navigate to its goal despite very crude and uncertain perceptions.
As a side effect of this work is the creation of two low/mid level library for controling WifiBots (TM) and KheperaIII robots. These libraries allow interaction with the robot either on the hardware level (setting motor speed, reading sensors) or on a slighlty more abstract level (advance, turn, stop). This software is available on the INRIA gforge at http://gforge.inria.fr/projects/wifibotlib .
Stochastic method application on Intelligent Transportation System
Participants : Maan Badaoui, Cindy Cappelle, Cherif Smaili, François Charpillet.
In order to obtain an autonomous navigation system for intelligent transportation system, vehicles embedded systems have to complete several tasks: localization, obstacle detection, trajectory planning and tracking, lane departure detection,... In our work, we study approaches based on the use of stochastic method for multisensors data fusion in taking into account especially quality and integrity of fusion results. In this work, we pay a special attention to be safety critical in estimating the confidence which one can be placed in the correctness of the estimation supplied by the whole system. We consider that managing multihypotheses can be a useful strategy to treat ambiguity situation induced by sensors uncertainty or failure. The multisensor fusion and multimodal estimation are realized using hybrid Bayesian network (HBN). The multimodal estimation can be a way to manage multihypothesis for the localisation task in order to take into account the event of a sensors or an information sources imprecision or failure [cite:SMAILI:2008:INRIA00339350:1]. There are several problems to tackle in order to develop a kind of fault tolerant data fusion approaches for safe vehicle localisation like the convergence robustness and divergence detection of multisensors fusion methods due to sensors measurements errors.
Our work tries also to study the optimal way to use new information sources like geographical 3D model managed in realtime by 3D Geographical Information System (3DGIS) to ameliorate an autonomous navigation system. In the last two years, we perform approaches for Mono and Multivehicles localisation, map matching and obstacle detection. Experimental results with real data are used to validate the developed approach and demonstrators are under development. This work is performed in the context of FD2S project of the GIS3SGS and the CRISTAL project [24] , [5] .
Automated Planning
Classical automated planning differs from Markov Decision Processes in that 1) transitions are deterministic and 2) the system is modelled in a structured — hence compact — manner using state variables. The present section presents work related both to classical (deterministic) planning and probabilistic planning, where problems involve both a structured representation and uncertainties in the system's dynamics.
The Factored PolicyGradient (FPG) Planner
Participants : Olivier Buffet, Joerg Hoffmann.
Douglas Aberdeen (Google Zürich) is an external collaborator.
FPG addresses probabilistic planning. A key issue in such planning is to exploit the problem's structure to make large instances solvable. Most approaches are based on algorithms which explore — at least partially — the state space, so that their complexity is usually linked to the number of states. Our approach — a joint work with Douglas Aberdeen — is very different in that it is exploring a space of parameterized controllers. By chosing factored controllers (one subcontroller per action) with state variables as inputs, we strongly reduce the complexity problem.
In practice, the Factored PolicyGradient (FPG) planner uses a policygradient reinforcement learning algorithm (coupled to a simulator) to optimize a controller based on a linear network (or a multilayer perceptron). Although suboptimal, FPG proved to be very efficient by winning the probabilistic track of the international planning competition 2006. One of its strengths lies in generalization: an action known to be good in certain states will be prefered in similar states. This novel approach is presented in full details in [4] .
Recent work includes comparing FPG with a probabilistic planner (named FQL) based on a Q learning algorithm. Although both algorithms use similar function approximators, FQL fails to provide good policies. Current research looks at using more appropriate policysearch algorithms based on population algorithms of actorcritic algorithms.
We are also developing a new method for reward shaping, where the problem's reward function is modified so as to encourage progress towards goal states. The reward shaping is nonintrusive in that it does not change the optimal solution to the problem; it is essential to success in problems with large search spaces, where reaching the goal by a pure random walk is very unlikely. The shaping is based on progress estimates which are derived from the structure of the problem. In a preprocess, our technique automatically detects landmarks — variable values that every successful path must at some point traverse — as well as pairwise constraints on the order in which that will happen. The progress estimator is based on reasoning about how many landmarks yet need to be achieved. A paper is in prepaparation for ECAI 2010.
Composition of Business Processes at SAP
Participant : Joerg Hoffmann.
Ingo Weber (University of New South Wales, Australia) and Frank Michael Kraft (SAP, Gerlany) are external collaborators.
The behavior of certain software artefacts can be naturally expressed, at an appropriate level of abstractions, in terms of their effect on state variable values. At SAP, a set of 2700 system transactions, which underly the execution of business processes, have been modeled in this way (the uncertainty lies in the fact that many transactions may have different outcomes depending on details that are abstracted away at the level of the model). Our work leverages on this model by providing a formalization in a planning language, and an adaptation of an existing planning tool. The resulting technology fully automatically composes useful business process fragments, requiring as input only the “goal”, i.e., a specification of which variables should assume which values. This corresponds well to the background and language of the targeted group of endusers (managers); in a prototype developed at SAP, the goal specification is given using simple dropdown menues. The technical description of the planning aspects of the work is published in an ICAPS'09 workshop [48] ; a full paper is in prepaparation for AAAI 2010.
Cellular Automata as a Planning Benchmark
Participants : Joerg Hoffmann, Nazim Fatès.
Hector Palacios (Universidad Simon Bolivar, Caracas, Venezuela) is an external collaborator.
An interesting question for several types of cellular automata is that of which behaviors lead to a stable system state where no more changes can be made. This corresponds to the problem of planning from an initial system state to a stable state. Adding uncertainty about what the initial state is, the task for the planner is to find a general strategy that leads to a stable state from many (in the extreme case, from all possible) start states. We have formulated this problem in a planning language, and are investigating under which conditions existing planning tools can solve the problem to which extent of generality. A paper is in preparation for ICAPS 2010.
SAT Performance and Abstraction in Classical Planning
Participant : Joerg Hoffmann.
Carmel Domshlak (Technion Haifa, Israel) and Ashish Sabharwal (Cornell University, USA) are external collaborators.
Planning as SAT is one of the most effective known approaches for finding plans with an optimality guarantee. The bottleneck lies in leading the optimality proof, which entails proving that no shorter plan exists. Similar disproval tasks have been addressed very successfully in Verification, by considering abstractions (overapproximations) of the system at hand. We applied this methodology to Classical Planning, and found that, somewhat surprisingly, hardly any empirical benefit can be gained. Towards explaining this, we have conducted a theoretical analysis, revealing that, in many of the considered SAT encodings of planning, abstraction cannot improve the bestcase behavior of resolution. This finding may be relevant as well for other areas (like Verification) where both abstraction and SAT solving have been successful. The results are presented in [9] .