Overall Objectives
Application Domains
New Software and Platforms
Bilateral Contracts and Grants with Industry
Partnerships and Cooperations
Bibliography
 PDF e-Pub

## Section: New Results

### Data-driven Numerical Modelling

#### High Energy Physics

Participants: Cécile Germain, Isabelle Guyon

Collaboration: D. Rousseau (LAL), M. Pierini (CERN)

The role and limits of simulation in discovery is the subject of V. Estrade’s PhD, specifically uncertainty quantification and calibration, that is how to handle the systematic errors, arising from the differences (“known unknowns”) between simulation and reality, coming from uncertainty in the so-called nuisance parameters. In the specific context of HEP analysis, where relatively numerous labelled data are available, the problem is at the crosspoint of domain adaptation and representation learning. We have investigated how to directly enforce the invariance w.r.t. the nuisance in the sought embedding through the learning criterion (tangent back-propagation) or an adversarial approach (pivotal representation). The results [93] contrast the superior performance of incorporating a priori knowledge on a well separated classes problem (MNIST data) with a real case setting in HEP, in relation with the Higgs Boson Machine Learning challenge [68] and the TRrackML challenge [46]. More indirect approaches based on either incorporating variance reduction for the parameter of interest or constraining the representation in a variational auto-encoder farmework are currently considered.

Anomaly detection (AD) is the subject of A. Pol's PhD. Reliable data quality monitoring is a key asset in delivering collision data suitable for physics analysis in any modern large-scale high energy physics experiment. [21] focuses on supervised and semi-supervised methods addressing the identification of anomalies in the data collected by the CMS muon detectors. The combination of DNN classifiers capable of detecting the known anomalous behaviors, and convolutional autoencoders addressing unforeseen failure modes has shown unprecedented efficiency. The result has been included in the production suite of the CMS experiment at CERN. Recent work has focused on improving AD for the trigger system, which is the first stage of event selection process in most experiments at the LHC at CERN. The hierarchical structure of the trigger process called for exploiting the advances in modeling complex structured representations that perform probabilistic inference effectively, and specifically variational autoencoders. Previous works argued that training VAE models only with inliers is insufficient and the framework should be significantly modified in order to discriminate the anomalous instances. In this work, we exploit the deep conditional variational autoencoder (CVAE) and we define an original loss function together with a metric that targets hierarchically structured data AD [39], [64]. This results in an effective, yet easily trainable and maintainable model.

The highly visible TrackML challenge [46] is described in section 7.6.

#### Remote Sensing Imagery

Participants: Guillaume Charpiat

Collaboration: Yuliya Tarabalka, Armand Zampieri, Nicolas Girard, Pierre Alliez (Titane team, Inria Sophia-Antipolis)

The analysis of satellite or aerial images has been a long-time ongoing topic of research, but the remote sensing community moved only very recently to a principled vision of the tasks in a machine learning perspective, with sufficiently large benchmarks for validation. The main topics are the segmentation of (possibly multispectral) remote sensing images into objects of interests, such as buildings, roads, forests, etc., and the detection of changes between two images of the same place taken at different moments. The main differences with classical computer vision is that images are large (covering whole countries, typically cut into $5000×5000$ pixels tiles), containing many small, potentially similar objects (and not one big object per image), that every pixel needs to be annotated (w.r.t. assigning a single label to a full image), and that the ground truth is often not reliable (spatially mis-registered, missing new constructions).

This year, we continued the work started in [167] about the registration of remote sensing images (RGB pictures) with cadastral maps (made of polygons indicating buildings and roads). We extended it in [33] to the case of real datasets, i.e. to noisy data. Indeed, in remote sensing, datasets are often large but of poor ground truth annotation quality. It turns out that, when training on datasets with noisy labels, one can still obtain accuracy scores far better than the noise variance in the training set, due to averaging effects over the labels of similar examples. To properly explain this, a theoretical study was conducted (cf. Section 7.2.2). Given any already trained neural network and its noisy training set, without knowing the real ground truth, we were then able to quantify this noise averaging effect [29].

We also tackled the problem of pansharpening, i.e. the one of producing a high-resolution color image, given a low-resolution color image and a high-resolution greyscale one [35], with deep convolutional neural networks as well.

#### Space Weather Forecasting

Participants: Cyril Furtlehner, Michèle Sebag

PhD: Mandar Chandorkar

Collaboration: Enrico Camporeale (CWI)

Space Weather is broadly defined as the study of the relationships between the variable conditions on the Sun and the space environment surrounding Earth. Aside from its scientific interest from the point of view of fundamental space physics phenomena, Space Weather plays an increasingly important role on our technology-dependent society. In particular, it focuses on events that can affect the performance and reliability of space-borne and ground-based technological systems, such as satellite and electric networks that can be damaged by an enhanced flux of energetic particles interacting with electronic circuits.(After a recent survey conducted by the insurance company Lloyd's, an extreme Space Weather event could produce up to \$2.6 trillion in financial damage.)

Since 2016, in the context of the Inria-CWI partnership, a collaboration between Tau and the Multiscale Dynamics Group of CWI aims to long-term Space Weather forecasting. The goal is to take advantage of the data produced everyday by satellites surveying the sun and the magnetosphere, and more particularly to relate solar images and the quantities (e.g., electron flux, proton flux, solar wind speed) measured on the L1 libration point between the Earth and the Sun (about 1,500,000 km and 1 hour time forward of Earth). A challenge is to formulate such goals in terms of supervised learning problem, while the "labels" associated to solar images are recorded at L1 (thus with a varying and unknown time lag). In essence, while typical ML models aim to answer the question What, our goal here is to answer both questions What and When. This project has been articulated around Mandar Chandorkar's Phd thesis[11] which has been defended this year in Eindhoven. One of the main result that has been obtained concerns the prediction of solar wind impacting earth magnetosphere from solar images. In this context we encountered an interesting sub-problem related to the non deterministic travel time of a solar eruption to earth's magnetosphere. We have formalized it as the joint regression task of predicting the magnitude of signals as well as the time delay with respect to their driving phenomena. We have provided in[28] an approach to this problem combining deep learning and an original Bayesian forward attention mechanism. A theoretical analysis based on linear stability has been proposed to put this algorithm on firm ground. From the practical point of view, encouraging tests have been performed both on synthetic data and real data with results slightly better than those present in the specialized literature on a small dataset. Various extension of the method, of the experimental tests and of the theoretical analysis are planned.

#### Genomic Data and Population Genetics

Participants: Guillaume Charpiat, Flora Jay, Aurélien Decelle, Cyril Furtlehner

PhD: Théophile Sanchez – PostDoc: Jean Cury

Collaboration: Bioinfo Team (LRI), Estonian Biocentre (Institute of Genomics, Tartu, Estonia), Pasteur Institute (Paris), TIMC-IMAG (Grenoble)

Thanks to the constant improvement of DNA sequencing technology, large quantities of genetic data should greatly enhance our knowledge about evolution and in particular the past history of a population. This history can be reconstructed over the past thousands of years, by inference from present-day individuals: by comparing their DNA, identifying shared genetic mutations or motifs, their frequency, and their correlations at different genomic scales. Still, the best way to extract information from large genomic data remains an open problem; currently, it mostly relies on drastic dimensionality reduction, considering a few well-studied population genetics features.

We developed an approach that extracts features from genomic data using deep neural networks and combines them with a Bayesian framework to approximate the posterior distribution of demographic parameters. The key difficulty is to build flexible problem-dependent architectures, supporting transfer learning and in particular handling data with variable size. We designed new generic architectures, that take into account DNA specificities for the joint analysis of a group of individuals, including its variable data size aspects and compared their performances to state-of-the-art approaches [148]. In the short-term these architectures can be used for demographic inference or selection inference in bacterial populations (ongoing work with a postdoctoral researcher, J Cury, and the Pasteur Institute); the longer-term goal is to integrate them in various systems handling genetic data or other biological sequence data.

In collaboration with the Institute of Genomics of Tartu (Estonia; B Yelmen, 3-month visitor at LRI), we leveraged two types of generative neural networks (Generative Adversarial Networks and Restricted Boltzmann Machines) to learn the high dimensional distributions of real genomic datasets and create artificial genomes [66]. These artificial genomes retain important characteristics of the real genomes (genetic allele frequencies and linkage, hidden population structure, ...) without copying them and have the potential to be valuable assets in future genetic studies by providing anonymous substitutes for private databases (such as the ones hold by companies or public institutes like the Institute of Genomics of Tartu). Yet, ensuring anonymity is a challenging point and we measured the privacy loss by using and extending the Adversarial Accuracy score developed by the team for synthetic medical data [44].

In collaboration with TIMC-IMAG, we proposed a new factor analysis approach that process genetic data of multiple individuals from present-day and ancient populations to visualize population structure and estimate admixture coefficients (that is, the probability that an individual belongs to different groups given the genetic data). This method corrects the traditionally-used PCA by accounting for time heterogeneity and enables a more accurate dimension reduction of paleogenomic data [59].

#### Sampling molecular conformations

Participants: Guillaume Charpiat

PhD: Loris Felardos

Collaboration: Jérôme Hénin (IBPC), Bruno Raffin (InriAlpes)

Numerical simulations on massively parallel architectures, routinely used to study the dynamics of biomolecules at the atomic scale, produce large amounts of data representing the time trajectories of molecular configurations, with the goal of exploring and sampling all possible configuration basins of given molecules. The configuration space is high-dimensional (10,000+), hindering the use of standard data analytics approaches. The use of advanced data analytics to identify intrinsic configuration patterns could be transformative for the field.

The high-dimensional data produced by molecular simulations live on low-dimensional manifolds; the extraction of these manifolds will enable to drive detailed large-scale simulations further in the configuration space. This year, we studied how to bypass simulations by directly predicting, given a molecule formula, its possible configurations. This is done using Graph Neural Networks [89] in a generative way, producing 3D configurations. The goal is to sample all possible configurations, and with the right probability.

#### Storm trajectory prediction

Participants: Guillaume Charpiat

Collaboration: Sophie Giffard-Roisin (IRD), Claire Monteleoni (Boulder University), Balazs Kegl (LAL)

Cyclones, hurricanes or typhoons all designate a rare and complex event characterized by strong winds surrounding a low pressure area. Their trajectory and intensity forecast, crucial for the protection of persons and goods, depends on many factors at different scales and altitudes. Additionally storms have been more numerous since the 1990s, leading to both more representative and more consistent error statistics.

Currently, track and intensity forecasts are provided by numerous guidance models. Dynamical models solve the physical equations governing motions in the atmosphere. While they can provide precise results, they are computationally demanding. Statistical models are based on historical relationships between storm behavior and other parameters [82]. Current national forecasts are typically driven by consensus methods able to combine different dynamical models.

Statistical models perform poorly compared to dynamical models, although they rely on steadily increasing data resources. ML methods have scarcely been considered, despite their successes in related forecasting problems [169]. A main difficulty is to exploit spatio-temporal patterns. Another difficulty is to select and merge data coming from heterogeneous sensors. For instance, temperature and pressure are real values on a 3D spatial grid, while sea surface temperature or land indication rely on a 2D grid, wind is a 2D vector field, while many indicators such as geographical location (ocean, hemisphere...) are just real values (not fields), and displacement history is a 1D vector (time). An underlying question regards the innate vs acquired issue, and how to best combine physical models with trained models. The continuation of the work started last year [101] shows that with deep learning one can outperform the state-of-the-art in many cases [18].