As the constant surge of computational power is nurturing scientists into simulating the most detailed features of reality, from complex molecular systems to climate or weather forecast, the computer simulation of physical systems is becoming reliant on highly complex stochastic dynamical models and very abundant observational data. The complexity of such models and of the associated observational data stems from intrinsic physical features, which do include high dimensionality as well as intricate temporal and spatial multi-scales. It also results in much less control over simulation uncertainty.

Within this highly challenging context, SIMSMART positions itself as a mathematical and computational probability and statistics research team, dedicated to *Monte Carlo simulation* methods. Such methods include in particular particle Monte Carlo methods for rare event simulation, data assimilation and model reduction, with application to stochastic random dynamical physical models. The main objective of SIMSMART is to disrupt this now classical field by creating deeper mathematical frameworks adapted to the management of contemporary highly sophisticated physical models.

**Introduction.** Computer simulation of physical systems is becoming increasingly reliant on highly complex models, as the constant surge of computational power is nurturing scientists into simulating the most detailed features of reality – from complex molecular systems to climate/weather forecast.

Yet, when modeling physical reality, bottom-up approaches are stumbling over intrinsic difficulties. First, the timescale separation between the fastest simulated microscopic features, and the macroscopic effective slow behavior becomes huge, implying that the fully detailed and direct long time simulation of many interesting systems (*e.g.* large molecular systems) are out of reasonable computational reach. Second, the chaotic dynamical behaviors of the systems at stake, coupled with such multi-scale structures, exacerbate the intricate uncertainty of outcomes, which become highly dependent on intrinsic chaos, uncontrolled modeling, as well as numerical discretization. Finally, the massive increase of observational data addresses new challenges to classical data assimilation, such as dealing with high dimensional observations and/or extremely long time series of observations.

**SIMSMART Identity.** Within this highly challenging applicative context, SIMSMART positions itself as a computational probability and statistics research team, with a mathematical perspective. Our approach is based on the use of *stochastic modeling* of complex physical systems, and on the use of *Monte Carlo simulation* methods, with a strong emphasis on dynamical models. The two main numerical tasks of interest to SIMSMART are the following: (i) simulating with pseudo-random number generators - a.k.a. *sampling* - dynamical models of random physical systems, (ii) sampling such random physical dynamical models given some real observations - a.k.a. *Bayesian data assimilation*. SIMSMART aims at providing an appropriate mathematical level of abstraction and generalization to a wide variety of Monte Carlo simulation algorithms in order to propose non-superficial answers to both *methodological and mathematical* challenges. The issues to be resolved include computational complexity reduction, statistical variance reduction, and uncertainty quantification.

**SIMSMART's Objectives.** The main objective of SIMSMART is to disrupt this now classical field of particle Monte Carlo simulation by creating deeper mathematical frameworks adapted to the challenging world of complex (*e.g.* high dimensional and/or multi-scale), and massively observed systems, as described in the beginning of this introduction.

To be more specific, we will classify SIMSMART objectives using the following four intertwined topics:

Objective 1: Rare events and random simulation.

Objective 2: High dimensional and advanced particle filtering.

Objective 3: Non-parametric approaches.

Objective 4: Model reduction and sparsity.

Rare events Objective 1 are ubiquitous in random simulation, either to accelerate the occurrence of physically relevant random slow phenomenons, or to estimate the effect of uncertain variables. Objective 1 will be mainly concerned with particle methods where *splitting* is used to enforce the occurrence of rare events.

The problem of high dimensional observations, the main topic in Objective 2, is a known bottleneck in filtering, especially in non-linear particle filtering, where linear data assimilation methods remain the state-of-the-art approaches.

The increasing size of recorded observational data and the increasing complexity of models also suggest to devote more effort into non-parametric data assimilation methods, the main issue of Objective 3.

In some contexts, for instance when one wants to compare solutions of a complex (*e.g.* high dimensional) dynamical systems depending on uncertain parameters, the construction of relevant reduced-order models becomes a key topic. This is the content of Objective 4.

With respect to volume of research activity, Objective 1, Objective 4 and the sum (Objective 2

Some new challenges in the simulation and data assimilation of random physical dynamical systems have become prominent in the last decade. A first issue (i) consists in the intertwined problems of simulating on large, macroscopic random times, and simulating *rare events*. The link between both aspects stems from the fact that many effective, large times dynamics can be approximated by sequences of rare events. A second, obvious, issue (ii) consists in managing *very abundant observational data*. A third issue (iii) consists in quantifying *uncertainty/sensitivity/variance* of outcomes with respect to models or noise. A fourth issue (iv) consists in managing *high dimensionality*, either when dealing with complex prior physical models, or with very large data sets. The related increase of complexity also requires, as a fifth issue (v), the construction of *reduced models* to speed-up comparative simulations. In a context of very abundant data, this may be replaced by a sixth issue (vi) where complexity constraints on modeling is replaced by the use of *non-parametric statistical inference*.

Hindsight suggests that all the latter challenges are related. Indeed, the contemporary digital condition, made of a massive increase in computational power and in available data, is resulting in a demand for more complex and uncertain models, for more extreme regimes, and for using inductive approaches relying on abundant data.

For simplicity, we have classified SIMSMART research into the following already mentioned four main objectives.

Objective 1: Rare events and random simulation, which mainly encompass item (i).

Objective 2: High dimension and advanced particle filtering, which encompass item (iv).

Objective 3: Non-parametric inference, which mainly encompass item (ii) and (vi).

Objective 4: Model reduction, which mainly encompasses item (vi).

Uncertainty quantification (item (iii)) in fact underlies each aspect since we are mainly interested in Monte Carlo approaches, so that uncertainty can be *modeled by an initial random variable and be incorporated in the state space of the physical model*.

The development of large-scale computing facilities has enabled simulations of systems at the *atomistic scale* on a daily basis. The aim of these simulations is to bridge the time and space scales between the macroscopic properties of matter and the stochastic atomistic description. Typically, such simulations are based on the ordinary differential equations of classical mechanics supplemented with a random perturbation modeling temperature, or collisions between particles.

Let us give a few examples. In bio-chemistry, such simulations are key to predict the influence of a ligand on the behavior of a protein, with applications to drug design. The computer can thus be used as a *numerical microscope* in order to access data that would be very difficult and costly to obtain experimentally. In that case, a rare event (Objective 1) is given by a macroscopic system change such as a conformation change of the protein. In nuclear safety, such simulations are key to predict the transport of neutrons in nuclear plants, with application to assessing aging of concrete. In that case, a rare event is given by a high energy neutron impacting concrete containment structures.

A typical model used in molecular dynamics simulation of open systems at given temperature is a stochastic differential equation of Langevin type. The large time behavior of such systems is typically characterized by a hopping dynamics between 'metastable' configurations, usually defined by local minima of a potential energy. In order to bridge the time and space scales between the atomistic level and the macroscopic level, specific algorithms enforcing the realization of rare events have been developed. For instance, splitting particle methods (Objective 1) have become popular within the computational physics community only within the last few years, partially as a consequence of interactions between physicists and Inria mathematicians in ASPI (parent of SIMSMART) and MATHERIALS project-teams.

The traditional trend in data assimilation in geophysical sciences (climate, meteorology) is to use as prior information some very complex deterministic models formulated in terms of fluid dynamics
and reflecting as much as possible the underlying physical phenomenon (see *e.g.* https://*in
situ*.

The main issue is therefore to perform such Bayesian estimations with very expensive infinite dimensional prior models, and observations in large dimension. The use of some linear assumption in prior models (Kalman filtering) to filter non-linear hydrodynamical phenomena is the state-of-the-art approach, and a current field of research, but is plagued with intractable instabilities.

This context motivates two research trends: (i) the introduction of non-parametric, model-free prior dynamics constructed from a large amount of past, recorded real weather data; and (ii) the development of appropriate non-linear filtering approaches (Objective 2 and Objective 3).

SIMSMART will also test its new methods on multi-source data collected in North-Atlantic paying particular attention to coastal areas (*e.g.* within the inter-Labex SEACS).

Adaptive Multilevel Splitting (AMS for short) is a generic Monte Carlo method for Markov processes that simulates rare events and estimates associated probabilities. Despite its practical efficiency, there are almost no theoretical results on the convergence of this algorithm. In , we prove both consistency and asymptotic normality results in a general setting. This is done by associating to the original Markov process a level-indexed process, also called a stochastic wave, and by showing that AMS can then be seen as a Fleming-Viot type particle system. This being done, we can finally apply general results on Fleming-Viot particle systems that we have recently obtained. In we extend the central limit theorem to the case of synchronized branchings, where re-sampling of particles is performed after any given number of particles have been killed. The result is obtained in the generic case of Fleming-Viot particle systems.

Probability measures supported on submanifolds can be sampled by adding an extra momentum variable to the state of the system, and discretizing the associated Hamiltonian dynamics with some stochastic perturbation in the extra variable. In order to avoid biases in the invariant probability measures sampled by discretizations of these stochastically perturbed Hamiltonian dynamics, a Metropolis rejection procedure can be considered. The so-obtained scheme belongs to the class of generalized Hybrid Monte Carlo (GHMC) algorithms. In , we show here how to generalize to GHMC a procedure suggested by Goodman, Holmes-Cerfon and Zappa for Metropolis random walks on submanifolds, where a reverse projection check is performed to enforce the reversibility of the algorithm for large timesteps and hence avoid biases in the invariant measure. We also provide a full mathematical analysis of such procedures, as well as numerical experiments demonstrating the importance of the reverse projection check on simple toy examples.

Production forecast errors are the main hurdle to integrate variable renewable energies into electrical power systems. Regardless of the technique, these errors are inherent in the forecast exercise, although their magnitude significantly vary depending on the method and the horizon. As power systems have to balance out these errors, their dynamic and stochastic modeling is valuable for the real time operation. The study in proposes a Markov Switching Auto Regressive – MS-AR – approach. After having validated its statistical relevance, this model is used to solve the problem of the optimal management of a storage associated with a wind power plant when this virtual power plant must respect a production commitment.

Model reduction aims at proposing efficient algorithmic procedures for the resolution (to some reasonable accuracy) of high-dimensional systems of parametric equations. This overall objective entails many different subtasks:

1) the identification of low-dimensional surrogates of the target “solution’’ manifold 2) The devise of efficient methodologies of resolution exploiting low-dimensional surrogates 3) The theoretical validation of the accuracy achievable by the proposed procedures

This year, we made several contributions to these subtasks. In most of our contributions, we deviated from the standard working hypothesis involving a linear subspace surrogate.

In a first group of publications, we concentrated our attention on the so-called “sparse’’ low-dimensional model. In this context, we have proposed several new algorithmic solutions to decrease the computational complexity associated to projection onto this low-dimensional model. These methodologies take place in the context of “screening’’ procedures for LASSO. We first introduced a new screening strategy, dubbed "joint screening test", which allows the rejection of a set of atoms by performing one single test, see . Our approach enables to find good compromises between complexity of implementation and effectiveness of screening. Second, we proposed two new methods to decrease the computational cost inherent to the construction of the (so-called) "safe region". Our numerical experiments show that the proposed procedures lead to significant computational gains as compared to standard methodologies, see . We finally showed in another work that the main concepts underlying screening procedures can be extended to different families of convex optimization problems, see .

Another avenue of research has been the study of the sparse surrogate in the context of “continuous’’ dictionaries, where the elementary signals forming the decomposition catalog are functions of some parameters taking its values in some continuously-valued domain. In this context, we contributed to the theoretical characterization of the performance of some well-known algorithmic procedure, namely “orthogonal matching pursuit’’ (OMP). More specifically, we proposed the first theoretical analysis of the behavior of OMP in the continuous setup, see , , . We also provided a new connection between two popular low-rank approximations of continuous dictionaries, namely the “polar’’ and “SVD’’ approximations, see .

The tools exploited in the field of model-order reduction and sparsity have found some particular applicative field in geophysics and fluid mechanics. In , , we derived procedures based on sparse representations to localize the positions of particles in a moving fluid. In , , , , , we designed learning methodologies to learn the dynamical model underlying a set of observed data.

**Scalian Alyotech**, through the CIFRE PhD project of Gabriel Jouan, dedicated to weather forecast corrections.

**Naval Group Research**, through the CIFRE PhD project of Audrey Cuillery dedicated to Bayesian tracking.

**Eau du Ponant**, through the R

**Cooper Standard**, Machine Learning for joints design.

**EURAMED** (a Euro-Mediterranean Cooperation Initiative, which aims to develop an Internet-based, multi-parametric electronic platform for optimum design of desalination plants, supplied by Renewable Energy Sources (RES). PI: E. Koutroulis (GREECE).

**Inter-Labex SEACS:** V. Monbet, F. Le Gland, C. Herzet and Thi Tuyet Trang Chau (PhD student) are part of the *inter Labex Cominlabs-Lebesgue-Mer SEACS, http:// www.seacs.cominlabs.ueb.eu/fr*, which stands for Stochastic modEl-dAta-Coupled representationS for the analysis, simulation and reconstruction of upper ocean dynamics. This project which concerns mainly Objectives 2 and 3, aims at exploring novel statistical and stochastic methods to address the emulation, reconstruction and forecast of fine-scale upper ocean dynamics.maths-computer-sea science for ocean dynamics.

**CMEMS 3DA (2018-2019):** C. Herzet is part of the project *CMEMS 3DA* on data assimilation of oceanographic events with non-parametric data assimilation methods. The goal of the project is to demonstrate the relevance of data-driven strategies to improve satellite derived
interpolated products and especially the geostrophic surface currents. The project is made in collaboration with IMT Atlantique Brest, Ifremer and the Institue of Geosciences and Environment in Grenoble.

**Action Exploratoire – Labex Cominlabs:** C. Herzet is part of a project on sparse representations in continuous dictionaries. Partners: R. Gribonval (Inria Rennes PANAMA), A. Drémeau (IMT Atlantique) and P. Tandeo (IMT Atlantique).

**ANR BECOSE (2016-2020):** Beyond Compressive Sensing: Sparse approximation algorithms for ill-conditioned inverse problems.

Cédric Herzet is part of the BECOSE project. The BECOSE project aims to extend the scope of sparsity techniques much beyond the academic setting of random and well-conditioned dictionaries. In particular, one goal of the project is to step back from the popular L1-convexification of the sparse representation problem and consider more involved nonconvex formulations, both from a methodological and theoretical point of view. The algorithms will be assessed in the context of tomographic Particle Image Velocimetry (PIV), a rapidly growing imaging technique in fluid mechanics that will have strong impact in several industrial sectors including environment, automotive and aeronautical industries.

**ANR Melody (2020-2024):** Bridging geophysics and MachinE Learning for the modeling, simulation and reconstruction of Ocean DYnamics.

Cédric Herzet is part of the MELODY project. The MELODY project aims to bridge the physical model‐driven paradigm underlying ocean/atmosphere science and AI paradigms with a view to developing geophysically‐sound learning‐based and data‐driven representations of geophysical flows accounting for their key features (e.g., chaos, extremes, high‐dimensionality).

**ERC MsMaths (2015-2019):** M. Rousset is part of *ERC MSMaths* on molecular simulation (PI T. Lelièvre). With the development of large-scale computing facilities, simulations of materials at the molecular scale are now performed on a daily basis. The objective of the MSMath ERC project is to develop and study efficient algorithms to simulate such high-dimensional systems over very long, macroscopic times. ERC MsMaths especially focus on the computational issues related to 'metastable' states, that is to say specific molecular configurations that do evolve only on very large time scales. This results in a multi-timescale computational bottleneck that needs to be addressed by specific algorithms.

**The agency European Organization for the Exploitation of Meteorological Satellites
(EUMETSAT)** of Darmstadt. The transfer focuses on the estimation of atmospheric 3D winds from the future hyperspectral instrument (IRS on MTG-S, developed by ESA and IASI-NG on Metop-SG developed by CNES).

**ECOS ARGENTINE (2018-2021):** V. Monbet has obtained a funding program through the ECOS Sud - MINCyT intiative (http://

Cédric Herzet is part of the organizing committee of the iTwist’20 Workshop.

Cédric Herzet has given:

INSA RENNES, 5ième année de l’option Génie Mathématique, cours de Parcimonie en traitement du signal et des images, 10h de cours magistraux + responsable du module

Ensai RENNES, Master international « Smart Data » , cours « Foundations of Smart Sensing », 9h de cours magistraux

Ensai RENNES, Master international « Smart Data » , cours « Advanced topics in Smart Sensing » , 3h de cours magistraux

Ensai RENNES, Master 1, « Régression pénalisée et sélection de modèles » , cours « Advanced topics in Smart Sensing », 6h de cours magistraux + 6 TPs + responsable du module

François Le Gland has given:

a 2nd year course on introduction to stochastic differential equations, at INSA (institut national des sciences appliquées) Rennes, within the GM/AROM (risk analysis, optimization and modeling) major in mathematical engineering,

a 3rd year course on Bayesian filtering and particle approximation, at ENSTA (école nationale supérieure de techniques avancées), Palaiseau, within the statistics and control module,

a 3rd year course on linear and nonlinear filtering, at ENSAI (école nationale de la statistique et de l'analyse de l'information), Ker Lann, within the statistical engineering track,

and a course on Kalman filtering and hidden Markov models, at université de Rennes 1, within the SISEA (signal, image, systèmes embarqués, automatique, école doctorale MATISSE) track of the master in electronical engineering and telecommunicationst.

Cédric Herzet has supervised:

Soufiane Ait Tilat, PhD, co-supervision with Frédéric Champagnat (Onera, Palaiseau)

Milan Courcoux-Caro, PhD, co-supervision with Charles Vanwynsberghe (ENSTA Bretagne) and Alexandre Baussard (IUT de Troyes),

Clément Elvira, postdoc, co-supervision with Rémi Gribonval (Inria Rennes) and Charles Soussen (CentraleSupélec)

Clément Dorffer, postdoc, co-supervision with Angélique Drémeau (ENSTA Rennes)

Mathias Rousset has supervised:

Benjamin Dufée, master 2 (with Fredéric Cérou).

François Le Gland has supervised

Audrey Cuillery, PhD,
provisional title: *Bayesian tracking from raw data*,
université de Rennes 1,
started in April 2016,
expected defense in early 2020,
funding: CIFRE grant with Naval Group,
co–direction: Dann Laneuville (Naval Group, Nantes).

V. Monbet has supervised

Gabriel Jouan, PhD, Univ Rennes, Scalian, granted by CIFRE.

Esso-Ridah Bleza, PhD, Univ Bretagne Sud, Janasense, granted by CIFRE.

Said Obakrim, PhD, Univ Rennes, Ifremer.

Y. Xu (sup.: M. Rousset and P.A. Zitt): Weak over-damped asymptotic and variance reduction

T.T.T. Chau (sup: V. Monbet): Non-parametric methodologies for reconstruction and estimation in nonlinear state-space models

M. Morvan (sup: V. Monbet): Modèles de régression pour données fonctionnelles hétérogènes. Application à la modélisation de données de spectrométrie dans le moyen infrarouge

V. Monbet has been a member of the following juries:

Anders Hildeman - On flexible random field models for spatial statistics: Spatial mixture models and deformed SPDE models, Chalmers University, Sweden.

Shuaitao Wang - Simulation du métabolisme de la Seine par assimilation de données en continu" , Mines Paris-Tech (reviewer)

Alban FARCHI - Localisation des méthodes d'assimilation de données d'ensemble, ENPC

François Le Gland has been a member of the following juries:

(Reviewer) Julien Lesouple (université de Toulouse, adviser : Jean-Yves Tourneret)

Émilien Flayac (université Paris–Saclay, Orsay, advisers : Frédéric Jean and Karim Dahia)

Thi Tuyet Trang Chau (université de Rennes 1, advisers : Valérie Monbet and Pierre Ailliot).

Patrick Héas has made a CS workshop for middle school students based on the ideas of 'Computer Science Unplugged'.