Section: New Results
Optimal Decision Making under Uncertainty
Participants : Olivier Teytaud [correspondent] , JeanJoseph Christophe, Jérémie Decock, Nicolas Galichet, Marc Schoenauer, Michèle Sebag, Weijia Wang.
The UCTSIG works on sequential optimization problems, where a decision has to be made at each time step along a finite time horizon, and the underlying problem involves uncertainties along an either adversarial or stochastic setting.
After several years of success in the domain of GO, the most prominent application domain here is now energy management, at various time scales, and more generally planning. Furthermore, the work in this SIG has also lead to advances in continuous optimization at large, that somehow overlap with the work in the OPTSIG (see 6.3 ).
The main advances done this year include:
 Banditbased Algorithms

Active learning for the identification of biological dynamical systems has been tackled using MultiArmed Bandit algorithms [35] . Weijia Wang's PhD [5] somehow summarizes the work done in TAO regarding Multiobjective Reinforcement Learning with MCTS algorithm. Differential Evolution was applied as an alternative to solve nonstationary Bandit problems [45] .
 Continuous optimization: parallelism, realworld, highdimension and cuttingplane methods

Our work in continuous optimization extends testbeds as follows: (i) including higher dimension (many testbeds in evolutionary algorithms consider dimension $\le 40$ or $\le 100$) (ii) taking into account computation time and not only the number of function evaluations (this makes a big difference in high dimension) (iii) including real world objective functions (iv) including parallelism, in particular, parallel convergence rates for differential evolution and particle swarm optimization [21] . We have a parallel version of cutting plane methods, which use more than blackbox evaluations of the objective functions  we keep in mind that some of our blackbox methods, on the other hand, also do not need convexity or the existence of a gradient.
 Noisy optimization

We have been working on noisy optimization in discrete and continuous domains. In the discrete case, we have shown the impact of heavy tails, and we have shown that resampling can solve some published open problems in an anytime manner. In the continuous case, we have shown [16] that a classical evolutionary principle (namely the stepsize proportional to the distance to the optimum) implies that the optimal rates can not be reached  more precisely, we can have simple regret at best $O(1/\sqrt{number\phantom{\rule{4pt}{0ex}}of\phantom{\rule{4pt}{0ex}}fitness\phantom{\rule{4pt}{0ex}}evaluations})$ in the simple case of an additive noise, whereas some published algorithms reached $O(1/number\phantom{\rule{4pt}{0ex}}of\phantom{\rule{4pt}{0ex}}fitness\phantom{\rule{4pt}{0ex}}evaluations)$. One of the most directly applicable of our works is bias correction when the objective function $f\left(x\right)$ has the form $f\left(x\right)={\mathbb{E}}_{\omega}f(x,\omega )$ and is approximated by $f\left(x\right)=\frac{1}{N}{\sum}_{i=1}^{N}f(x,{\omega}_{i})$ for a given finite sample ${\omega}_{1},\cdots ,{\omega}_{N}$. We have also worked on portfolios of noisy optimizers [20] , [34] .
 Discretetime control with constrained action spaces.

While Direct Policy Search is a reliable approach for discrete time control, it is not easily applicable in the case of a constrained highdimensional action space. In the past, we have proposed DVS (Direct Value Search) for such cases [54] . The method is satisfactory, and we have additional mathematical results; in particular we prove positive results for nonMarkovian, nonconvex problems, and we prove a polynomialtime decision making and, simultaneously, exact asymptotic consistency for a nonlinear transition [24] . Related work [60] also proposes to directly learn the value function, in a RL context, using some trajectories known to be bad.
 Games.

While still lightly contributing to the game of GO with our taiwanese partners [8] , we obtained significant improvements in randomized artificial intelligence algorithms by decomposing the variance of the result into (i) the random seed (ii) the other random contributions such as the random seed of the opponent and/or the random part in the game. By optimizing our probability distribution on random seeds, we get significant improvements in e.g. phantom Go. This is basically a simple tool for learning opening books [44] .
 Adversarial bandits.

Highdimensional adversarial bandits lead to two main drawbacks: (i) computation time (ii) highly mixed nature of the obtained solution. We developped methods which focus on sparse solution. Provably consistent, these methods are faster when the Nash equilibrium is sparse, and provides highly sparse solutions[17] .