## Section: New Results

### Optimal Decision Making

Participants : Olivier Teytaud, David Auger, Michèle Sebag, Cyril Furtlehner, Jean-Baptiste Hoock, Nataliya Sokolovska, Fabien Teytaud, Hassen Doghmen, Jean-Joseph Christophe, Jérémie Decock.

Monte-Carlo Tree Search (MCTS) and Upper Confidence Trees (UCT) are main areas of the team. In particular, we ultra-weakly solved 7x7 Go by winning 20 games out of 20 against professional players in 7x7 Go, thanks to a Meta-Monte-Carlo-Tree Search[24] . The wins were with komi 9.5 as white, and 8.5 as black, suggesting that the ideal komi in 7x7 is 9.We also applied this algorithm to the recent “NoGo” framework, aimed at challenging MCTS for a game which looks like Go but with very different goals; our paper [26] was the first one applying MCTS to NoGo and now all strong programs use the MCTS approach for NoGo. We extended RAVE (Rapid Action Value Estimates) to the continuous settings [29] . In his PhD [2] , Fabien Teytaud proposed several generic improvements of MCTS, including the use of (fast) decisive and anti-decisive moves for games, and applied it to the game of Havannah. An industrial application (to energy management) is proposed in [71] . A MCTS version for partially observable problems with bounded horizon was proposed in [86] . This version is proposed for the two-player case, but for simulations starting at the root; a version in the one-player case, starting from an arbitrary state (and therefore much more efficient for large horizon) is proposed in [30] . This work is extended by a belief state estimation by constraint satisfaction problems in [62] . Other developments and research around MCTS/UCT are described in the MoGo module.

A related important algorithm is Nested Monte-Carlo; we got state of the art results for some traveling salesman variants with a very simple algorithm in [56] .

Fundamental analysis of partially observable games: we proved in [5] that partially observable games are undecidable (result also presented in the BIRS 2010 workshop and the Bielefeld seminar on Search Methodologies), even in the case of finite state spaces and deterministic transitions. This unexpected result is a priori a contradiction with known decidability results; this emphasizes the subtle difference between the classical decision problem (the existence of a strategy winning certainly, whatever may be the strategy of the opponent), which is used is most analysis, and the choice of the move with optimal winning probability. We pointed out that the relevant decision problem is, with no doubt, the latter; that the other decision problem has just been used because it is equivalent to choosing optimal play in the case of fully observable games; and, most importantly, that partially observable games are in fact undecidable, even in the finite deterministic case. On the other hand, on restricted settings, we have shown by some simple lemmas lower and upper bounds on the value of some partially observable games [63] . We extended Monte-Carlo Tree Search to the case of short-term partial information in [61] ; this was successfully applied to the Urban Rival game, a widely played internet card game (now 17 millions of registered users) from a French company.

Tuning of strategies: tuning strategies is a noisy optimization problem in which the convenient “variance of noise decreasing to zero around the optimum” usually does not hold. We have shown that in such a setting, the local bandit-style algorithms are slower than surrogate models; this is detailed in the continuous optimization part.

We organized various computer-Go events, as due to the fame of our program MoGo we are often invited for such events; reports can be found in [95] .

We developed the “double progressive widening” trick, which is aimed at making consistent an algorithm from the finite case to the continuous stochastic case; we got good results in [60] on Q-Learning (with no mathematical proof) and on MCTS [28] (mathematical proof to be submitted soon).

We have also worked on Nash equilibria of Matrix Games, where we proposed an algorithm for finding Nash equilibria faster when the Nash equilibrium is sparse [48] , [47] ; a mathematical proof is ready and will be submitted soon.

Some works are in progress around applications of previous tools to active learning; active learning has also been investigated through conditional random fields in [59] .

Another related work, with motivations from autonomous robotics, combines the exploration of the search space through UCT, with an explicit model of the safe regions explored so far, called

*Deja-Vu*. The Deja-Vu is used to constrain the exploration, mostly in the random phase, and is updated from the current explorations [67] .The Ilab “Metis” just started; it's an Ilab between Tao, the Inria-Saclay team MaxPlus, and the SME Artelys http://www.artelys.com for a joint work on numerical libraries in Energy Management.