Overall Objectives
Research Program
Application Domains
New Software and Platforms
New Results
Bilateral Contracts and Grants with Industry
Partnerships and Cooperations
XML PDF e-pub
PDF e-Pub

Section: New Results

Decision Making

Complexity Analysis of Exact Dynamic Programming Algorithms for MDPs

Participant : Bruno Scherrer.

Eugene Feinberg and Jefferson Huang are external collaborators from Stony Brooks University.

Following last year's work on the strong polynomiality of Policy Iteration, we show that the number of arithmetic operations required by any member of a broad class of optimistic policy iteration algorithms to solve a deterministic discounted dynamic programming problem with three states and four actions may grow arbitrarily. Therefore any such algorithm is not strongly polynomial. In particular, the modified policy iteration and λ-policy iteration algorithms are not strongly polynomial. This work was published in the Operations Research Letters [4] .

Analysis of Approximate Dynamic Programming Algorithms for MDPs

Participants : Bruno Scherrer, Manel Tagorti.

Matthieu Geist is an external collaborator from Supélec.

In [40] , we consider LSTD(λ), the least-squares temporal-difference algorithm with eligibility traces algorithm proposed by Boyan (2002). It computes a linear approximation of the value function of a fixed policy in a large Markov Decision Process. Under a β-mixing assumption, we derive, for any value of λ(0,1), a high-probability estimate of the rate of convergence of this algorithm to its limit. We deduce a high-probability bound on the error of this algorithm, that extends (and slightly improves) that derived by Lazaric et al. (2012) in the specific case where λ=0. In particular, our analysis sheds some light on the choice of λ with respect to the quality of the chosen linear space and the number of samples, that complies with simulations. This work was presented at the National JFPDA conference [34] .

In the context of infinite-horizon discounted optimal control problem formalized by Markov Decision Processes, we focus on several approximate variations of the Policy Iteration algorithm: Approximate Policy Iteration (API), Conservative Policy Iteration (CPI), a natural adaptation of the Policy Search by Dynamic Programming algorithm to the infinite-horizon case (PSDP), and the recently proposed Non-Stationary Policy Iteration (NSPI). For all algorithms, we describe performance bounds with respect the per-iteration error ϵ, and make a comparison by paying a particular attention to the concentrability constants involved, the number of iterations and the memory required. Our analysis highlights the following points: 1) The performance guarantee of CPI can be arbitrarily better than that of API, but this comes at the cost of a relative—exponential in 1ϵ—increase of the number of iterations. 2) PSDP enjoys the best of both worlds: its performance guarantee is similar to that of CPI, but within a number of iterations similar to that of API. 3) Contrary to API that requires a constant memory, the memory needed by CPI and PSDP is proportional to their number of iterations, which may be problematic when the discount factor γ is close to 1 or the approximation error ϵ is close to 0; we show that the NSPI algorithm allows to make an overall trade-off between memory and performance. Simulations with these schemes confirm our analysis. This work was presented at this year's international conference on Machine Learning (ICML) [28] .

Finally, we consider Local Policy Search, that is a popular reinforcement learning approach for handling large state spaces. Formally, it searches locally in a parameterized policy space in order to maximize the associated value function averaged over some predefined distribution. The best one can hope in general from such an approach is to get a local optimum of this criterion. The first contribution of this article is the following surprising result: if the policy space is convex, any (approximate) local optimum enjoys a global performance guarantee. Unfortunately, the convexity assumption is strong: it is not satisfied by commonly used parameterizations and designing a parameterization that induces this property seems hard. A natural solution to alleviate this issue consists in deriving an algorithm that solves the local policy search problem using a boosting approach (constrained to the convex hull of the policy space). The resulting algorithm turns out to be a slight generalization of conservative policy iteration; thus, our second contribution is to highlight an original connection between local policy search and approximate dynamic programming. This work was presented at this year's European conference on Machine Learning (ECML) [27] .

Adaptive Management with POMDPs

Participants : Olivier Buffet, Jilles Dibangoye.

Samuel Nicol and Iadine Chadès (CSIRO) are external collaborators.

In the field of conservation biology, adaptive management is about managing a system, e.g., performing actions so as to protect some endangered species, while learning how it behaves. This is a typical reinforcement learning task that could for example be addressed through Bayesian Reinforcement Learning.

During Samuel Nicol's visit, the main problem we have studied is how to manage company inspections to deter these companies from adopting dangerous behaviors. This was modeled as a particular Stackelberg game, where N companies benefit from acting badly as long as they are not caught by inspections, and where 1 government agency has to decide which companies to inspect given a limited budget. The expected result is a stochastic strategy (randomly deciding which companies to inspect, with probabilities that depend on the benefits/losses of both types of players). We are working on exploiting particular features of this computationally complex problem to make it more tractable.

Solving decentralized stochastic control problems as continuous-state MDPs

Participants : Jilles Dibangoye, Olivier Buffet, François Charpillet.

External collaborators: Christopher Amato (MIT).

Decentralized partially observable Markov decision processes (DEC-POMDPs) are rich models for cooperative decision-making under uncertainty, but are often intractable to solve optimally (NEXP-complete), even using efficient heuristic search algorithms.

State-of-the-art approaches relied on turning a Dec-POMDP into an equivalent deterministic MDP —whose actions at time t correspond to a vector containing one decision rules (/instantaneous policy) per agent— typically solved using a heuristic search algorithm inspired by A*. In recent work (IJCAI'13), we have identified a sufficient statistic of this MDP —an occupancy state, i.e., a probability distribution over possible states and joint histories of the agents— and demonstrated that the value function was piecewise-linear and convex with respect to this statistic. This brings us in the same situation as POMDPs, allowing to generalize the value function from one occupancy state to another and to propose much faster algorithms (also using efficient compression methods).

This year, we have further progressed on this line of research.

Learning Bad Actions

Participant : Olivier Buffet.

Jörg Hoffmann, former member of MAIA, Michal Krajňanský (Saarland University), and Alan Fern (Oregon State University) are external collaborators.

In classical planning, a key problem is to exploit heuristic knowledge to efficiently guide the search for a sequence of actions leading to a goal state.

In some settings, one may have the opportunity to solve multiple small instances of a problem before solving larger instances, e.g., trying to handle a logistics problem with small numbers of trucks, depots and items before moving to (much) larger numbers. Then, the small instances may allow to extract knowledge that could be reused when facing larger instances. Previous work shows that it is difficult to directly learn rules specifying which action to pick in a given situation. Instead, we look for rules telling which actions should not be considered, so as to reduce the search space. But this approach requires considering multiple questions: What are examples of bad (or non-bad) actions? How to obtain them? Which learning algorithm to use?

A first algorithm (with variants) has been proposed that learns rules for detecting (supposedly) bad actions. It has been empirically evaluated, providing encouraging results, but also showing that different variants will perform best in different settings. This algorithm has been presented at ECAI'2014 [24] , and has participated in the learning track of the international planning competition in 2014 ( ).