Section: New Results
Optimal Decision Making
This SIG is devoted to all aspects of Artificial Intelligence related to sequential settings, specifically sequential decision under uncertainty and sequential learning. World successes in computed-Go have been made visible through scientific and vulgarization papers (section 6.5.1 ). Other applications unrelated to the Game of Go have been realized and published (section 6.5.2 ).
Sequential Decision Under Uncertainty Applied To Games
The game of Go, a 2000 year-old Asian game most important to China, Korea, Taiwan, and Japan, proves to be much more difficult for computers than chess (note that humans cannot win any longer against computers at the game of chess, unless the machine plays with a handicap).
Among the recent progresses in this game, we designed a (grid-based computed) set of openings, using a Monte-Carlo Tree Search strategy  . Several learning modes (offline, online and transient learning) were combined  . The role of exploration has been carefully analyzed  , showing that in many cases this role should remain very moderate (its weight in UCT-like implementations should be close to 0), except perhaps when the exploration term involves the use of patterns or other aspects than the number of simulations. The role of offline learning is key to the beginning of the learning curve (first visit to a node), while online learning governs the asymptotic regime of the process; the role of transient learning is to manage the transition between the offline and online learning regimes. This multi-level learning approach is an original contribution to the field of Go.
As mentioned in the highlights, MoGo and its branch MoGoTW (joint work with the national university of Tainan  ) remain the world leading programs in computer-Go(First and only win as black in 9x9 Go against a top professional player; first and only win with H7 against a top professional player; first and only win with H6 against a professional player.) and our results have been widely acknowledged(The communications of the ACM, vol. 51, Nb. 10, (10/08), page 13, published the first ever won against a professional player in 19x19, as well as several newspapers http://www.lri.fr/~teytaud/mogo.html .). The generality of the MoGo principles has been mentioned earlier on; actually, very similar techniques can be applied to other games, in particular Havannah, a game designed for challenging computers  . Furthermore, in collaboration with the SPIRAL team at Carnegie Mellon University (Pittsburg, USA), the same princpiles have been successfully applied to the on-line choice of the components of the SPIRAL library  ,  . It was shown that frugality in bandits is crucial for Monte-Carlo Tree Search applications, explaining why exploration constants are null in most strong computer-Go programs  .
Other Bandit-based Applications
The MoGo ideas were applied in three more fields: active learning, “optimal” optimization and feature selection. Philippe Rolet's PhD (Digiteo grant, coll. with CEA DM2S) formalizes active learning as a reinforcement learning problem, and proposes a one-player game approach to approximate the (intractable) optimal strategy thereof. This approach features an any-time behaviour, asymptotically delivering the optimal strategy for instance selection with respect to the expectation of the generalization error conditionally to the number of queries allowed. The convergence speed has been experimentally shown to be satisfactory  ,  , see below.
Along similar lines, the feasability of optimal optimization has been investigated using UCT algorithms; the goal likewise is to find the optimum of a function with a minimum number of queries, minimizing the loss expectation  ,  . While this algorithm is computationally very expensive, it features an optimal query strategy for a given prior on the function space.
Lastly, Romaric Gaudel's PhD (ENS grant) formalizes feature selection as a reinforcement learning problem, aimed at minimizing the expectation of the generalization error; the optimal (intractable) strategy is approximated again using an extended version of UCT, dealing with a finite unknown number of options  .
Interestingly, all above approaches combine the same ingredients beyond the MoGo expertise:
Each problem is formalized as a Partially Observable Markov Decision Process (POMDP), where the optimal strategy is the solution of an (intractable) reinforcement learning problem;
An any-time approximation of this strategy is obtained using UCT, using diverse extensions in order to deal with continuous spaces and a finite unknown horizon;
The Monte-Carlo search within UCT relies on billiard algorithms, generating conditional distributions depending on the priors, with good scalability with respect to the dimension of the search space.