Section: New Results
Deep Learning and Information Theory

Natural Gradients for Deep Learning Deep learning is now established as a stateoftheart technology for performing different tasks such as image or sequence processing. Nevertheless, much of the computational burden is spent on tuning the hyperparameters. Ongoing work, started during the TIMCO project, is proposing, in the framework of Riemannian gradient descents, invariant algorithms for training neural networks that effectively reduce the number of arbitrary choices, e.g., affine transformations of the activation functions or shuffling of the inputs. Moreover, the Riemannian gradient descent algorithms perform as well as the stateoftheart optimizers for neural networks, and are even faster for training complex models. The proposed approach is based on Amari's theory of information geometry and consists in practical and wellgrounded approximations for computing the Fisher metric. The scope of this framework is larger than Deep Learning and encompasses any class of statistical models.

Training dynamical systems online without backtracking with application to recurrent neural networks [73] . We propose an algorithm to learn the parameters of a dynamical system in an online, memoryless setting, thus requiring no backpropagation through time, and consequently scalable, avoiding the large computational and memory cost of maintaining the full gradient of the current state with respect to the parameters. The algorithm essentially maintains, at each time, a single search direction in parameter space. The evolution of this search direction is partly stochastic and is constructed in such a way to provide, at every time, an unbiased random estimate of the gradient of the loss function with respect to the parameters.

Approximating Bayesian predictors thanks to Laplace's rule of succesion Laplace's "addone" rule of succession modifies the observed frequencies in a sequence of heads and tails by adding one to the observed counts. This improves prediction by avoiding zero probabilities and corresponds to a uniform Bayesian prior on the parameter. We prove that, for any exponential family of distributions, arbitrary Bayesian predictors can be approximated by taking the average of the maximum likelihood predictor and the sequential normalized maximum likelihood predictor from information theory, which generalizes Laplace's rule. The proof heavily involves the geometry provided by the Fisher information matrix. Thus it is possible to approximate Bayesian predictors without the cost of integrating or sampling in parameter space[46] .