Section: New Results
Learning to Learn
Auto*
Participants: Guillaume Charpiat, Isabelle Guyon, Marc Schoenauer, Michèle Sebag
PhDs: Léonard Blier, Guillaume Doquet, Zhengying Liu, Herilalaina Rakotoarison, Lisheng Sun, Pierre Wolinski
Collaboration: Vincent Renault (SME Artelys); Yann Ollivier (Facebook)
Auto$\u2606$ studies at Tau investigate several research directions.
As mentioned in Section 3.3, a popular approach for algorithm selection is collaborative filtering. In Lisheng Sun's PhD [15], active learning was used on top of the CofiRank algorithm for matrix factorization [164], improving the results and the time to solution of the recommendation algorithm. Furthermore, most realworld domains evolve with time, and an important issue in realworld applications is that of lifelong learning, as static models can rapidly become obsolete. Another contribution in Lisheng's PhD is an extension of AutoSklearn that detects concept drifts and corrects the current model accordingly.
An original approach to Auto$\u2606$, explored in Herilalaina Rakotoarison's PhD, extends and adapts MonteCarlo Tree Search to explore the structured space of preprocessing + learning algorithm configurations, and gradually determine the best pipeline [40]; the resulting Mosaic algorithm performs on par with AutoSklearn, the winner of Auto$\u2606$ international competitions in the last few years.
Auto$\u2606$ would be much easier if appropriate and affordable metafeatures (describing datasets) were available. Taking inspiration from equivariant learning [144] and learning from distributions [130], ongoing work aims to learn such metafeatures, based on the OpenML archive [161].
A key building block in Auto$\u2606$ lies in the data preparation and specifically variable selection. Guillaume Doquet's PhD [13] addresses the problem of agnostic feature selection, independently of any target variable. The point d'orgue of his work is Agnos [32] (Best Paper Award at ECML 2019), that combines an AutoEncoder with structural regularizations to sidestep the combinatorial optimization problem at the core of feature selection. The extensive experimental validation of AgnoS on the scikitfeature benchmark suite demonstrates its ability compared to the state of the art, both in terms of supervised learning and data compression.
Several works have focused on the adjustment of specific hyperparameters for neural nets. Pierre Wolinski's PhD (to be defended in January 2020, publication submitted) studies three such hyperparameters: i) network width (number of neurons in each layer); ii) regularizer importance in the objective function to minimize (factor balancing data term and regularizer); and iii) learning rate. Regarding the network width, it is adjusted during training thanks to a criterion quantifying each neuron's importance, naturally leading to a sparsification effect (as for L1 norm minimization). This study is actually extendable to not only layers' widths but also layers' connectivity (e.g., in modern networks where each layer may be connected to any other layer with 'skip' connections). Regarding the regularizer weight, it is formulated as a probabilistic prior from a Bayesian perspective, which leads to a particular value that the regularizer weight should have in order the network to satisfy some property. Regarding the learning rate, Pierre Wolinski and Leonard Blier [27] proposed to attach fixed learning rates to each neuron (picked randomly) and calibrate this learning rate distribution in such a way that neurons are sequentially active, learning in an optimally agile manner during a first learning phase, and being stable in later phases. This remove the need to tune the learning rate.
A last direction of investigation concerns the design of challenges, that contribute to the collective advance of research in the Auto$\u2606$ direction. The team has been very active in the series of AutoML challenges [154], and steadily contributes to the organization of new challenges (Section 7.6).
Deep Learning: Practical and Theoretical Insights
Participants: Guillaume Charpiat, Marc Schoenauer, Michèle Sebag
PhDs: Léonard Blier, Corentin Tallec
Collaboration: Yann Ollivier (Facebook AI Research, Paris), the Altschuler and Wu lab. (UCSF, USA), Y. Tarabalka (Inria Titane)
Although a comprehensive mathematical theory of deep learning is yet to come, theoretical insights from information theory or from dynamical systems can deliver principled improvements to deep learning and/or explain the empirical successes of some architectures compared to others.
In his PhD [16], Corentin Tallec presents several contributions along these lines:

In [158], it is shown that the LSTM structure can be understood from axiomatic principles, enforcing the robustness of the learned model to temporal deformation (warpings) in the data. The complex LSTM architecture, introduced in the 90's, has become the currently dominant architecture for modeling temporal sequences (such as text) in deep learning. It is shown that this complex LSTM architecture necessarily arises if one wants the model to be able to handle time warpings in the data (such as arbitrary accelerations or decelerations in the signal), and their complex equations can be derived axiomatically.

In [132] (oral presentation at ICML 2018) the issue of mode dropping in adversarial generative models is tackled using information theory. The adversary (discriminator) task is set to predict the proportion of true and fake images in a set of images, via an information theory criterion, thus working at the level of the overall distribution of images. The discriminator is thus mande more able to detect statistical imbalances between the modes created by the generator, thereby reducing the mode dopping phenomenon. The proposed architecture, inspired from equivariant approaches, is provably able to detect all permutationinvariant statistics in a set of images.

In [159], the problem of recurrent network training is tackled via the theory of dynamical systems by proposing a simple fully online solution avoiding the "time rewind" step, based on realtime, noisy but unbiased approximations of model gradients, which can be implemented easily in a blackbox fashion on top of any recurrent model, and which is welljustified mathematically. The price to pay is an increase of variance.

In [42], we identify sensitivity to time discretization of Deep RL in near continuoustime environments as a critical factor. Empirically, we find that Qlearningbased approaches collapse with small time steps. Formally, we prove that Qlearning does not exist in continuous time. We detail a principled way to build an offpolicy RL algorithm that yields similar performances over a wide range of time discretizations, and confirm this robustness empirically.
Several other directions have been investigated: In [41], we introduce a multidomain adversarial learning algorithm in the semisupervised setting. We extend the single source Hdivergence theory for domain adaptation to the case of multiple domains, and obtain bounds on the average and worstdomain risk in multidomain learning. This leads to a new loss to accommodate semisupervised multidomain learning and domain adaptation. We obtain stateoftheart results on two standard image benchmarks, and propose as a new benchmark a novel bioimage dataset, CELL, in the domain of automated microscopy data, where cultured cells are imaged after being exposed to known and unknown chemical perturbations, and in which each dataset displays significant experimental bias.
Another direction regards the topology induced by a trained neural net, and how similar are two samples in the NN perspective. The definition proposed in [29] relies on varying the NN parameters and examining whether the impacts of this variation on both samples are aligned. The mathematical properties of this similarity measure are investigated and the similarity is shown to define a kernel on the input space. This kernel can be used to tractably estimate the sample density, and it leads to new directions for the statistical learning analysis of NN, e.g. in terms of additional loss (requiring that similar examples have a similar latent representation in the above sense) or in terms of resistance to noise. Specifically, a multimodal image registration task is presented where almost perfect accuracy is reached, despite a high label noise (see Section 7.5.2). Such an impressive selfdenoising phenomenon can be explained and quantified as a noise averaging effect over the labels of similar examples.
Analyzing and Learning Complex Systems
Participants: Cyril Furtlehner, Aurélien Decelle, François Landes
PhDs: Giancarlo Fissore
Collaboration: Jacopo Rocchi (LPTMS Paris Sud); the Simons team: Rahul Chako (postdoc), Andrea Liu (UPenn), David Reichman (Columbia), Giulio Biroli (ENS), Olivier Dauchot (ESPCI).; Clément Vignax (EPFL); Yufei Han (Symantec).
The information content of a trained restricted Boltzmann machine (RBM) can be analyzed by comparing the singular values/vectors of its weight matrix, referred to as modes, to that of a random RBM (typically following a MarchenkoPastur distribution) [83]. The analysis of a single learning trajectory is replaced by analyzing the distribution of a well chosen ensemble of models. In G. Fissore's PhD, the learning trajectory of an RBM is shown to start with a linear phase recovering the dominant modes of the data, followed by a nonlinear regime where the interaction among the modes is characterized [84]. Although simplifying assumptions are required for a meanfield analysis in closed form of the above distribution, it nevertheless delivers some simple heuristics to speed up the learning convergence and to simplify the models.
This analysis will be extended along two directions: handling missing data [58]; and considering exactly solvable RBM (nonlinear RBM for which the contrastive divergence can be computed in closed forms, e.g. using a spherical model) [54]. W.r.t. missing data, state of the art results have been obtained on semisupervised tasks in the context of InternetofThings security, considering a high rate of missing inputs and labels). On the theoretical side exact generic RBM learning trajectories have been characterized, showing intriguing connections based on BoseEinstein condensation mechanism associated to information storing. Our collaboration with J. Rocchi(LPTMS, Univ. Paris Sud) aims to characterize the landscape of RBMs learned from different initial conditions, and to relate this landscape to the number of parameters (hidden nodes) of the system.
An emerging research topic, concerns the interpretation of deep learning by means of Gaussian processes and associated neural tangent limit kernel in the thermodynamical limit obtained by letting layer's width go to infinity [118]. Various things are planned to be investigated on the basis of this theoretical tool in particular how this translate to RBM or DBM setting, and whether a double dip behaviour is to be expected as well for generative models.
As mentioned earlier, the use of ML to address fundamental physics problems is quickly growing. This leads to some methodological mistakes from newcomers, that have been investigated by Rémi Perrier (2 month internship). One example is the domain of glasses (how the structure of glasses is related to their dynamics), which is one of the major problems in modern theoretical physics. The idea is to let ML models automatically find the hidden structures (features) that control the flowing or nonflowing state of matter, discriminating liquid from solid states. These models can then help identifying "computational order parameters", that would advance the understanding of physical phenomena [19], on the one hand, and support the development of more complex models, on the other hand. More generally, attacking the problem of amorphous condensed matter by novel Graph Neural Networks (GNN) architectures is a very promising lead, regardless of the precise quantity one may want to predict. Currently GNNs are engineered to deal with molecular systems and/or crystals, but not to deal with amorphous matter. This second axis is currently being attacked in collaboration with Clément Vignac (PhD Student at EPFL), using GNNs. Furthermore, this problem is new to the ML community and it provides an original nontrivial example for engineering, testing and benchmarking explainability protocols.