## Section: New Results

### Machine Learning for Stochastic Optimisation

Participants : Anne Auger, Sylvain Gelly, Nikolaus Hansen, Mohamed Jebalia, Marc Schoenauer, Michèle Sebag, Olivier Teytaud.

This research direction investigates how the theoretical and algorithmic body of knowledge developed in Machine Learning can advance the fundamental study of stochastic optimisation, extend its scope and support more effective algorithms. Three types of contributions have been made; the first one is related to the theoretical study of Estimation of Distribution Algorithms; the second one considers surrogate optimisation, extending EC to deal with computationally expensive objective functions; the third one is related to Optimal Decision Making, revisiting dynamic programming and investigating multi-armed bandit algorithms.

#### Fundamental studies of evolutionary algorithms.

Estimation of Distribution Algorithms (EDAs) evolve a probability distribution on the search space by repeating a series of sampling/selection/learning steps. Using Statistical Learning theory, EDAs have been studied in the context of expensive optimisation problems allowing only a small numbers of iterates [Oops!] , and some optimal (w.r.t. robustness) comparison-based EDAs have been proposed [Oops!] . Note that this research is closely related to that on surrogate models decribed in next section 6.2.2 .

Genetic Programming (GP) extends the Evolutionary Computation paradigm to tree-structured search spaces, essentially the space of programs. In this field where theory still lags far behind practice, several classical results of Statistical Learning theory have allowed to
delineate the applicability of this technique, and to derive sufficient conditions on the penalty term used in practice to limit the uncontrolled growth of the solution (aka
*bloat* )
(S. Gelly, O. Teytaud, N. Bredèche, and M. Schoenauer. Universal consistency and bloat in GP.
*Revue d'Intelligence Artificielle* , 20(6):805–827, 2006.); more recently, necessary conditions have been established, barring the use of some heuristics as inconsistent (with no guarantee of asymptotic convergence to the optimal solution)
[Oops!] .

More generally, several techniques borrowed from Machine Learning and Complexity Theory have been applied to theoretical investigations of Evolutionary Algorithms. This includes results on the consistency of halting criteria and sufficient conditions for convergence in non-convex settings [Oops!] ; the non-validity of the No-Free-Lunch Theorem in continuous optimisation [Oops!] ; some limitations of multi-objective optimisation without feedback from the user [Oops!] ; an analysis of the parametrisation of the computational effort in stochastic optimisation [Oops!] , [Oops!] ; and studies of different ways to use quasi-random points in Evolutionary Algorithms [Oops!] .

#### Approximations of the fitness function.

Evolutionary Algorithms are known to be computationally expensive. They are hence particularly concerned with what Mechanical Engineers have called Response Surface Methods, revisited in the last few years as Surrogate Models methods. The idea is to build an approximation of the objective function, and to run the optimisation algorithm (whatever the algorithm) on the approximation rather than on the original function. A first crucial issue is the choice of the approximation model. And, because the approximation has to be updated as the search proceeds, another important issue is how often this update has to be done.

Within the ANR/RNTL OMD project, TAO is in charge of the technology transfer related to surrogate methods, motivated by the expensive benchmark problems of the industrial OMD partners (Dassault, Renault and EADS), using in particular the surrogate-based version of CMA-ES
(Kern, S., N. Hansen, P. Koumoutsakos. Local Meta-Models for Optimisation Using Evolution Strategies. In Th. Runarsson, ed.,
*Proc. PPSN IX* , pp.939-948, LNCS 4193, Springer Verlag, 2006.)as in
[Oops!] . In particular, TAO contribution to OMD includes the port to Scilab of both the original and the
surrogate versions of CMA-ES.

#### Approximate Dynamic Programming and Multi-Armed Bandit Problems

In some problems, the goal is to find the optimal policy in the sense of some (delayed) reward function; an intermediate step thus is to learn the value function, associating to each problem state the reward expectation. Dynamic programming, a sound and robust approach dating back to the 50's, suffers from the curse of dimensionality (W.G. Powell: Approximate Dynamic Programming; Solving the curses of dimensionality, John Wiley and Sons, 2006.)and Monte-Carlo planning approaches are being investigated to address this limitation. TAO has been working in both areas.

OpenDP
(Sylvain Gelly and Olivier Teytaud. Opendp: a free reinforcement learning toolbox for discrete time control problems. In
*NIPS Workshop on Machine Learning Open Source Software* , 2006.)is an open source platform for approximate dynamic programming
(http://opendp.sourceforge.net ), which has been thoroughly benchmarked to assess diverse sampling, learning, optimization and frugal non-linear programming algorithms. Experimental comparisons have been reported in Sylvain Gelly's PhD
[Oops!] , together with theoretical results related to deterministic, random and quasi-random sampling
[Oops!] ,
[Oops!] .

Monte-Carlo planning approaches have been investigated, with the domain of computer-go as motivating application. The MoGo program [Oops!] , [Oops!] , [Oops!] , embedding a Monte-Carlo evaluation function within the Multi-Armed Bandit framework, currently is the best computer-go program (http://www.lri.fr/~gelly/MoGo.htm )See also sections 2.3 and 5.1 .

The Multi-Armed Bandit setting has been intensively studied in TAO from a theoretical
[Oops!] and applicative
[Oops!] perspective. The MAB extension to dynamic settings has been considered in relation with the Pascal
Challenge
*Online Trading of Exploration vs Exploitation* (http://www.pascal-network.org/Challenges/EEC/ ); TAO won the challenge in 2006, emphasising the successful use of change point detection techniques
[Oops!] (Cédric Hartland, Sylvain Gelly, Nicolas Baskiotis, Olivier Teytaud, and Michele Sebag. Multi-armed bandit, dynamic environments and meta-bandits,
*Online Trading of Exploration and Exploitation Workshop, NIPS* , 2006.). Directions for further research are to extend the Multi-Armed Bandit algorithms underlying MoGo, and to address multi-variate, multi-objective bandit problems.