Section: New Results
Decision Making
Complexity Analysis of Exact Dynamic Programming Algorithms for MDPs
Participant : Bruno Scherrer.
Eugene Feinberg and Jefferson Huang are external collaborators from Stony Brooks University.
Following last year's work on the strong polynomiality of Policy Iteration, we show that the number of arithmetic operations required by any member of a broad class of optimistic policy iteration algorithms to solve a deterministic discounted dynamic programming problem with three states and four actions may grow arbitrarily. Therefore any such algorithm is not strongly polynomial. In particular, the modified policy iteration and $\lambda $policy iteration algorithms are not strongly polynomial. This work was published in the Operations Research Letters [4] .
Analysis of Approximate Dynamic Programming Algorithms for MDPs
Participants : Bruno Scherrer, Manel Tagorti.
Matthieu Geist is an external collaborator from Supélec.
In [40] , we consider LSTD($\lambda $), the leastsquares temporaldifference algorithm with eligibility traces algorithm proposed by Boyan (2002). It computes a linear approximation of the value function of a fixed policy in a large Markov Decision Process. Under a $\beta $mixing assumption, we derive, for any value of $\lambda \in (0,1)$, a highprobability estimate of the rate of convergence of this algorithm to its limit. We deduce a highprobability bound on the error of this algorithm, that extends (and slightly improves) that derived by Lazaric et al. (2012) in the specific case where $\lambda =0$. In particular, our analysis sheds some light on the choice of $\lambda $ with respect to the quality of the chosen linear space and the number of samples, that complies with simulations. This work was presented at the National JFPDA conference [34] .
In the context of infinitehorizon discounted optimal control problem formalized by Markov Decision Processes, we focus on several approximate variations of the Policy Iteration algorithm: Approximate Policy Iteration (API), Conservative Policy Iteration (CPI), a natural adaptation of the Policy Search by Dynamic Programming algorithm to the infinitehorizon case (PSDP), and the recently proposed NonStationary Policy Iteration (NSPI). For all algorithms, we describe performance bounds with respect the periteration error $\u03f5$, and make a comparison by paying a particular attention to the concentrability constants involved, the number of iterations and the memory required. Our analysis highlights the following points: 1) The performance guarantee of CPI can be arbitrarily better than that of API, but this comes at the cost of a relative—exponential in $\frac{1}{\u03f5}$—increase of the number of iterations. 2) PSDP${}_{\infty}$ enjoys the best of both worlds: its performance guarantee is similar to that of CPI, but within a number of iterations similar to that of API. 3) Contrary to API that requires a constant memory, the memory needed by CPI and PSDP is proportional to their number of iterations, which may be problematic when the discount factor $\gamma $ is close to 1 or the approximation error $\u03f5$ is close to 0; we show that the NSPI algorithm allows to make an overall tradeoff between memory and performance. Simulations with these schemes confirm our analysis. This work was presented at this year's international conference on Machine Learning (ICML) [28] .
Finally, we consider Local Policy Search, that is a popular reinforcement learning approach for handling large state spaces. Formally, it searches locally in a parameterized policy space in order to maximize the associated value function averaged over some predefined distribution. The best one can hope in general from such an approach is to get a local optimum of this criterion. The first contribution of this article is the following surprising result: if the policy space is convex, any (approximate) local optimum enjoys a global performance guarantee. Unfortunately, the convexity assumption is strong: it is not satisfied by commonly used parameterizations and designing a parameterization that induces this property seems hard. A natural solution to alleviate this issue consists in deriving an algorithm that solves the local policy search problem using a boosting approach (constrained to the convex hull of the policy space). The resulting algorithm turns out to be a slight generalization of conservative policy iteration; thus, our second contribution is to highlight an original connection between local policy search and approximate dynamic programming. This work was presented at this year's European conference on Machine Learning (ECML) [27] .
Adaptive Management with POMDPs
Participants : Olivier Buffet, Jilles Dibangoye.
Samuel Nicol and Iadine Chadès (CSIRO) are external collaborators.
In the field of conservation biology, adaptive management is about managing a system, e.g., performing actions so as to protect some endangered species, while learning how it behaves. This is a typical reinforcement learning task that could for example be addressed through Bayesian Reinforcement Learning.
During Samuel Nicol's visit, the main problem we have studied is how to manage company inspections to deter these companies from adopting dangerous behaviors. This was modeled as a particular Stackelberg game, where $N$ companies benefit from acting badly as long as they are not caught by inspections, and where 1 government agency has to decide which companies to inspect given a limited budget. The expected result is a stochastic strategy (randomly deciding which companies to inspect, with probabilities that depend on the benefits/losses of both types of players). We are working on exploiting particular features of this computationally complex problem to make it more tractable.
Solving decentralized stochastic control problems as continuousstate MDPs
Participants : Jilles Dibangoye, Olivier Buffet, François Charpillet.
External collaborators: Christopher Amato (MIT).
Decentralized partially observable Markov decision processes (DECPOMDPs) are rich models for cooperative decisionmaking under uncertainty, but are often intractable to solve optimally (NEXPcomplete), even using efficient heuristic search algorithms.
Stateoftheart approaches relied on turning a DecPOMDP into an equivalent deterministic MDP —whose actions at time $t$ correspond to a vector containing one decision rules (/instantaneous policy) per agent— typically solved using a heuristic search algorithm inspired by A*. In recent work (IJCAI'13), we have identified a sufficient statistic of this MDP —an occupancy state, i.e., a probability distribution over possible states and joint histories of the agents— and demonstrated that the value function was piecewiselinear and convex with respect to this statistic. This brings us in the same situation as POMDPs, allowing to generalize the value function from one occupancy state to another and to propose much faster algorithms (also using efficient compression methods).
This year, we have further progressed on this line of research.

A journal paper has been submitted that presents the “occupancy MDP” approach in details.

In the case of NetworkDistributed POMDPs, a particular setting where the relations between agents follow a fixed network topology, we have shown that the value function could be decomposed additively with one value function per neighborhood. This work has been presented at AAMAS'2014 [12] , receiving the conference's best paper award.

To further scale up the resolution of DecPOMDPs, we have proposed multiple approximations techniques that can be combined and allow controlling error bounds. This work has been presented at ECML'2014 [13] .
Learning Bad Actions
Participant : Olivier Buffet.
Jörg Hoffmann, former member of MAIA, Michal Krajňanský (Saarland University), and Alan Fern (Oregon State University) are external collaborators.
In classical planning, a key problem is to exploit heuristic knowledge to efficiently guide the search for a sequence of actions leading to a goal state.
In some settings, one may have the opportunity to solve multiple small instances of a problem before solving larger instances, e.g., trying to handle a logistics problem with small numbers of trucks, depots and items before moving to (much) larger numbers. Then, the small instances may allow to extract knowledge that could be reused when facing larger instances. Previous work shows that it is difficult to directly learn rules specifying which action to pick in a given situation. Instead, we look for rules telling which actions should not be considered, so as to reduce the search space. But this approach requires considering multiple questions: What are examples of bad (or nonbad) actions? How to obtain them? Which learning algorithm to use?
A first algorithm (with variants) has been proposed that learns rules for detecting (supposedly) bad actions. It has been empirically evaluated, providing encouraging results, but also showing that different variants will perform best in different settings. This algorithm has been presented at ECAI'2014 [24] , and has participated in the learning track of the international planning competition in 2014 (http://ipc.icapsconference.org/ ).