    PDF e-Pub

## Section: Research Program

### Developing Novel Theoretical Frameworks for Analyzing and Designing Adaptive Stochastic Algorithms

Stochastic black-box algorithms typically optimize non-convex, non-smooth functions. This is possible because the algorithms rely on weak mathematical properties of the underlying functions: the algorithms do not use the derivatives—hence the function does not need to be differentiable—and, additionally, often do not use the exact function value but instead how the objective function ranks candidate solutions (such methods are sometimes called function-value-free).(To illustrate a comparison-based update, consider an algorithm that samples $\lambda$ (with $\lambda$ an even integer) candidate solutions from a multivariate normal distribution. Let ${x}_{1},...,{x}_{\lambda }$ in ${ℝ}^{n}$ denote those $\lambda$ candidate solutions at a given iteration. The solutions are evaluated on the function $f$ to be minimized and ranked from the best to the worse:

$f\left({x}_{1:\lambda }\right)\le ...\le f\left({x}_{\lambda :\lambda }\right)\phantom{\rule{5.0pt}{0ex}}.$

In the previous equation $i\phantom{\rule{-0.166667em}{0ex}}:\phantom{\rule{-0.166667em}{0ex}}\lambda$ denotes the index of the sampled solution associated to the $i$-th best solution. The new mean of the Gaussian vector from which new solutions will be sampled at the next iteration can be updated as

$m←\frac{1}{\lambda }\sum _{i=1}^{\lambda /2}{x}_{i:\lambda }\phantom{\rule{5.0pt}{0ex}}.$

The previous update moves the mean towards the $\lambda /2$ best solutions. Yet the update is only based on the ranking of the candidate solutions such that the update is the same if $f$ is optimized or $g\circ f$ where $g:\mathrm{Im}\left(f\right)\to ℝ$ is strictly increasing. Consequently, such algorithms are invariant with respect to strictly increasing transformations of the objective function. This entails that they are robust and their performances generalize well.)

Additionally, adaptive stochastic optimization algorithms typically have a complex state space which encodes the parameters of a probability distribution (e.g. mean and covariance matrix of a Gaussian vector) and other state vectors. This state-space is a manifold. While the algorithms are Markov chains, the complexity of the state-space makes that standard Markov chain theory tools do not directly apply. The same holds with tools stemming from stochastic approximation theory or Ordinary Differential Equation (ODE) theory where it is usually assumed that the underlying ODE (obtained by proper averaging and limit for learning rate to zero) has its critical points inside the search space. In contrast, in the cases we are interested in, the critical points of the ODEs are at the boundary of the domain.

Last, since we aim at developing theory that on the one hand allows to analyze the main properties of state-of-the-art methods and on the other hand is useful for algorithm design, we need to be careful not to use simplifications that would allow a proof to be done but would not capture the important properties of the algorithms. With that respect one tricky point is to develop theory that accounts for invariance properties.

To face those specific challenges, we need to develop novel theoretical frameworks exploiting invariance properties and accounting for peculiar state-spaces. Those frameworks should allow researchers to analyze one of the core properties of adaptive stochastic methods, namely linear convergence on the widest possible class of functions.

We are planning to approach the question of linear convergence from three different complementary angles, using three different frameworks:

• the Markov chain framework where the convergence derives from the analysis of the stability of a normalized Markov chain existing on scaling-invariant functions for translation and scale-invariant algorithms . This framework allows for a fine analysis where the exact convergence rate can be given as an implicit function of the invariant measure of the normalized Markov chain. Yet it requires the objective function to be scaling-invariant. The stability analysis can be particularly tricky as the Markov chain that needs to be studied writes as ${\Phi }_{t+1}=F\left({\Phi }_{t},{W}_{t+1}\right)$ where $\left\{{W}_{t}:t>0\right\}$ are independent identically distributed and $F$ is typically discontinuous because the algorithms studied are comparison-based. This implies that practical tools for analyzing a standard property like irreducibility, that rely on investigating the stability of underlying deterministic control models , cannot be used. Additionally, the construction of a drift to prove ergodicity is particularly delicate when the state space includes a (normalized) covariance matrix as it is the case for analyzing the CMA-ES algorithm.

• The stochastic approximation or ODE framework. Those are standard techniques to prove the convergence of stochastic algorithms when an algorithm can be expressed as a stochastic approximation of the solution of a mean field ODE , , . What is specific and induces difficulties for the algorithms we aim at analyzing is the non-standard state-space since the ODE variables correspond to the state-variables of the algorithm (e.g. ${ℝ}^{n}×{ℝ}_{>0}$ for step-size adaptive algorithms, ${ℝ}^{n}×{ℝ}_{>0}×{S}_{++}^{n}$ where ${S}_{++}^{n}$ denotes the set of positive definite matrices if a covariance matrix is additionally adapted). Consequently, the ODE can have many critical points at the boundary of its definition domain (e.g. all points corresponding to ${\sigma }_{t}=0$ are critical points of the ODE) which is not typical. Also we aim at proving linear convergence, for that it is crucial that the learning rate does not decrease to zero which is non-standard in ODE method.

• The direct framework where we construct a global Lyapunov function for the original algorithm from which we deduce bounds on the hitting time to reach an $ϵ$-ball of the optimum. For this framework as for the ODE framework, we expect that the class of functions where we can prove linear convergence are composite of $g\circ f$ where $f$ is differentiable and $g:\mathrm{Im}\left(f\right)\to ℝ$ is strictly increasing and that we can show convergence to a local minimum.

We expect those frameworks to be complementary in the sense that the assumptions required are different. Typically, the ODE framework should allow for proofs under the assumptions that learning rates are small enough while it is not needed for the Markov chain framework. Hence this latter framework captures better the real dynamics of the algorithm, yet under the assumption of scaling-invariance of the objective functions. Also, we expect some overlap in terms of function classes that can be studied by the different frameworks (typically convex-quadratic functions should be encompassed in the three frameworks). By studying the different frameworks in parallel, we expect to gain synergies and possibly understand what is the most promising approach for solving the holy grail question of the linear convergence of CMA-ES. We foresee for instance that similar approaches like the use of Foster-Lyapunov drift conditions are needed in all the frameworks and that intuition can be gained on how to establish the conditions from one framework to another one.