Team tao

Overall Objectives
Scientific Foundations
Application Domains
New Results
Contracts and Grants with Industry
Other Grants and Activities

Section: Scientific Foundations

Generalized Bandits

The blossoming use of Multi-Armed Bandit (MAB) algorithms to revisit reinforcement learning (P. Auer and R. Ortner. Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning. NIPS'06 pp 49–56, MIT Press, 2007), tree search (A typical example is UCT: L. Kocsis and C. Szepesvári. Bandits-based Monte-Carlo planning. In J. Fürnkranz et al., eds., ECML'06, pp 282-293, LNAI 4212, Springer Verlag, 2006.))including games [Oops!] , [Oops!] , [Oops!] (see section 6.2.3 ), optimization (P.-A. Coquelin and R. Munos, Bandit Algorithms for Tree Search, Technical report INRIA 6141, 2007.)is explained from two factors. On the one hand, MABs aim at minimizing the regret, i.e. the cumulative loss over the oracle strategy; this elegant criterion is amenable to theoretical bounds; furthermore, it is relevant in an any-time perspective, whereas many theoretical studies mostly focus on asymptotic performances (see also section 3.4 ). On the other hand, the decision making process achieved by MABs enforce an Exploitation vs Exploration (EvE) tradeoff; a wide variety of Exploration-related criteria has been considered in the literature, with and (most often) without theoretical justifications.

Several extensions of MAB algorithms and analysis have been identified as theoretical and applicative priorities on our research agenda:

These research themes are directly relevant to Microsoft-TAO project , SYMBRION IP and Autonomic Computing .

Typically, the online learning of hyperparameters tackled by Microsoft-TAO project can be formalized as a MAB problem; bounded rationality is similarly relevant at least in the calibration stage of the current application. The multi-objective aspect is also relevant as the algorithmic performance can usually be assessed along quite a few independent criteria (e.g. time-to-solution, memory usage, solution accuracy).

Independently, optimal decision making under bounded resource constraints is relevant to SYMBRION IP. Likewise, the mid-term goal of Autonomic Computing is to deliver satisficing job schedulers. In both cases, as experiments are done in situ the learning algorithm must find a way to preserve the system integrity, and limit the risks incurred by any system unit.


Logo Inria