Section: Scientific Foundations
The blossoming use of Multi-Armed Bandit (MAB) algorithms to revisit reinforcement learning (P. Auer and R. Ortner. Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning. NIPS'06 pp 49–56, MIT Press, 2007), tree search (A typical example is UCT: L. Kocsis and C. Szepesvári. Bandits-based Monte-Carlo planning. In J. Fürnkranz et al., eds., ECML'06, pp 282-293, LNAI 4212, Springer Verlag, 2006.))including games [Oops!] , [Oops!] , [Oops!] (see section 6.2.3 ), optimization (P.-A. Coquelin and R. Munos, Bandit Algorithms for Tree Search, Technical report INRIA 6141, 2007.)is explained from two factors. On the one hand, MABs aim at minimizing the regret, i.e. the cumulative loss over the oracle strategy; this elegant criterion is amenable to theoretical bounds; furthermore, it is relevant in an any-time perspective, whereas many theoretical studies mostly focus on asymptotic performances (see also section 3.4 ). On the other hand, the decision making process achieved by MABs enforce an Exploitation vs Exploration (EvE) tradeoff; a wide variety of Exploration-related criteria has been considered in the literature, with and (most often) without theoretical justifications.
Several extensions of MAB algorithms and analysis have been identified as theoretical and applicative priorities on our research agenda:
A first extension is required to deal with dynamic environments, relaxing the assumption of iid rewards for each option (bandit arm). Let us consider for instance the EvE tradeoff at the core of evolutionary computation, of game strategies [Oops!] , of news recommendation (Cédric Hartland, Sylvain Gelly, Nicolas Baskiotis, Olivier Teytaud, and Michele Sebag. Multi-armed bandit, dynamic environments and meta-bandits, Online Trading of Exploration and Exploitation Workshop, NIPS , 2006.); the rewards associated to a given option (respectively, variation operator, game move or news item) are not stationary; they evolve while the search or the game goes on, or as the user's needs and mood change. Some algorithmic advances have been made in Tao (development of MoGo, winning participation to the Pascal Challenge on Online Trading of EvE), extending the standard Upper Confidence Bound algorithms to handle non stationary environments. Further work is required to extend MAB algorithms to Monte-Carlo-based planning (as an alternative to dynamic programming) and to provide theoretical guarantees on the global solution quality.
A second extension regards multi-variate bandits. In quite a few application domains, some side information is available (e.g. the user profile in a news recommendation context) and can be used to handle the EvE tradeoff more efficiently. In the MoGo system, the so-called RAVE ( Rapid Action Value Estimate ) heuristics provides additional estimates of the move values; significant improvement of MoGo has been obtained by exploiting this additional, most often strongly biased, side information. Notably, multi-variate bandit algorithms have been acknowledged a prioritary research direction in the PASCAL-2 roadmap (The PASCAL Network of Excellence (2003-2008) will be continued in the FP7 framework (2008-2013).). Further study is required to both design more efficient multi-variate bandit algorithms, and provide theoretical guarantees thereof.
Thirdly, the extension of MAB algorithms to the bounded rationalityframework, e.g., increasing the number of options and considering a short-term time horizon, is a both theoretical and applicative challenge. Quite a few application domains involve many options (e.g. circa 400 arms in computer-go, and infinitely many in continuous frameworks); further more, in games or planning, the stress is put on the short term performances (as opposed to, the asymptotic ones). We have developed efficient anytime algorithms, extending Berry et al. (D. A. Berry, R. W. Chen, A. Zame, D. C. Heath, and L. A. Shepp. Bandit problems with infinitely many arms, Ann. Stat. 25(5):2103-2116, 1997.)in the so-called easy setting where the reward distribution is favorable (many arms have a reward probability close to 1); we also proposed original heuristics in the so-called difficult setting (the reward probabilities are in [0, ] , with <1 ). While the empirical efficiency of these algorithms has been found satisficing, their theoretical fundations remain to be established.
A fourth research perspective related to MAB is concerned with multi-objective settings. For instance in autonomous robot control, every option can be assessed along several criteria, such as its value (to which extent the option is instrumental to reach the robot goal) and its risks (to which extent the option can dammage the robot integrity). Although multi-objective optimization can always be cast as a mono-objective optimization problem (e.g. classically considering the weighted sum of the objective functions as single objective), it is believed that multi-objective bandits correspond to a relevant and daring extension of MAB algorithms. On the one hand, this extension aims at finding optimal, e.g. controlled risk-taking, decision strategies; on the other hand it requires to extend the regret definition (e.g. cumulative distance to the Pareto front).
These research themes are directly relevant to Microsoft-TAO project , SYMBRION IP and Autonomic Computing .
Typically, the online learning of hyperparameters tackled by Microsoft-TAO project can be formalized as a MAB problem; bounded rationality is similarly relevant at least in the calibration stage of the current application. The multi-objective aspect is also relevant as the algorithmic performance can usually be assessed along quite a few independent criteria (e.g. time-to-solution, memory usage, solution accuracy).
Independently, optimal decision making under bounded resource constraints is relevant to SYMBRION IP. Likewise, the mid-term goal of Autonomic Computing is to deliver satisficing job schedulers. In both cases, as experiments are done in situ the learning algorithm must find a way to preserve the system integrity, and limit the risks incurred by any system unit.