Keywords
 A3.1.1. Modeling, representation
 A3.1.8. Big data (production, storage, transfer)
 A3.3. Data and knowledge analysis
 A3.3.3. Big data analysis
 A3.4. Machine learning and statistics
 A3.4.1. Supervised learning
 A3.4.2. Unsupervised learning
 A3.4.3. Reinforcement learning
 A3.4.4. Optimization and learning
 A3.4.5. Bayesian methods
 A3.4.7. Kernel methods
 A3.5.1. Analysis of large graphs
 A5.9.2. Estimation, modeling
 A6. Modeling, simulation and control
 A6.1. Methods in mathematical modeling
 A6.2. Scientific computing, Numerical Analysis & Optimization
 A6.2.4. Statistical methods
 A6.3. Computationdata interaction
 A6.3.1. Inverse problems
 A6.3.3. Data processing
 A6.3.4. Model reduction
 A9.2. Machine learning
 B1.1.4. Genetics and genomics
 B1.1.7. Bioinformatics
 B2.2.4. Infectious diseases, Virology
 B2.3. Epidemiology
 B2.4.1. Pharmaco kinetics and dynamics
 B3.4. Risks
 B4. Energy
 B4.4. Energy delivery
 B4.5. Energy consumption
 B5.2.1. Road vehicles
 B5.2.2. Railway
 B5.2.3. Aviation
 B5.5. Materials
 B5.9. Industrial maintenance
 B7.1. Traffic management
 B7.1.1. Pedestrian traffic and crowds
 B9.5.2. Mathematics
 B9.8. Reproducibility
 B9.9. Ethics
1 Team members, visitors, external collaborators
Research Scientists
 Kevin Bleakley [Inria, Researcher]
 Gilles Celeux [Inria, Emeritus]
 Gilles Stoltz [CNRS, Researcher, HDR]
Faculty Members
 Sylvain Arlot [Team leader, Univ ParisSaclay, Professor, HDR]
 Christophe Giraud [Univ ParisSaclay, Professor, HDR]
 Alexandre Janon [Univ ParisSaclay, Associate Professor]
 Christine Keribin [Univ ParisSaclay, Associate Professor, HDR]
 Pascal Massart [Univ ParisSaclay, Professor, HDR]
 Patrick Pamphile [Univ ParisSaclay, Associate Professor]
 MarieAnne Poursat [Univ ParisSaclay, Associate Professor]
PostDoctoral Fellow
 Evgenii Chzhen [Univ ParisSaclay]
PhD Students
 Yvenn AmaraOuali [Univ ParisSaclay]
 Emilien Baroux [Groupe PSA, from Jul 2020]
 Margaux Bregere [EDF, until Oct 2020]
 Geoffrey Chinot [ENSAE, until Aug 2020]
 Olivier Coudray [Groupe PSA]
 Remi Coulaud [SNCF, CIFRE]
 Solenne Gaucher [ENSAE]
 Hedi Hadiji [Ministère de l'Enseignement Supérieur et de la Recherche]
 Karl Hajjar [Univ ParisSaclay, from Oct 2020]
 Malo Huard [Univ ParisSaclay]
 Yann Issartel [Univ ParisSaclay]
 Perrine Lacroix [Univ ParisSaclay]
 Guillaume Maillard [Univ ParisSaclay]
 Timothee Mathieu [École Normale Supérieure de Cachan]
 El Mehdi Saad [Univ ParisSaclay]
Technical Staff
 Benjamin Auder [CNRS, Engineer]
Interns and Apprentices
 Cecile Poulain [Inria, from Mar 2020 until Aug 2020]
Administrative Assistant
 Laurence Fontana [Inria, from Oct 2020]
External Collaborators
 Claire Lacour [Univ ParisEst Marne La Vallée]
 Matthieu Lerasle [CNRS, HDR]
2 Overall objectives
2.1 Mathematical statistics and learning
Data science – a vast field that includes statistics, machine learning, signal processing, data visualization, and databases – has become frontpage news due to its everincreasing impact on society, over and above the important role it already played in science over the last few decades. Within data science, the statistical community has longterm experience in how to infer knowledge from data, based on solid mathematical foundations. The more recent field of machine learning has also made important progress by combining statistics and optimization, with a fresh point of view that originates in applications where prediction is more important than building models.
The Celeste projectteam is positioned at the interface between statistics and machine learning. We are statisticians in a mathematics department, with strong mathematical backgrounds behind us, interested in interactions between theory, algorithms and applications. Indeed, applications are the source of many of our interesting theoretical problems, while the theory we develop plays a key role in (i) understanding how and why successful statistical learning algorithms work – hence improving them – and (ii) building new algorithms upon mathematical statisticsbased foundations
In the theoretical and methodological domains, Celeste aims to analyze statistical learning algorithms – especially those which are most used in practice – with our mathematical statistics point of view, and develop new learning algorithms based upon our mathematical statistics skills.
A key ingredient in our research program is connecting our theoretical and methodological results with (a great number of) realworld applications. Indeed, Celeste members work in many domains, including—but not limited to—Covid19, neglected tropical diseases, pharmacovigilance, highdimensional transcriptomic analysis, and energy and the environment.
3 Research program
3.1 General presentation
Our objectives correspond to four major challenges of machine learning where mathematical statistics have a key role. First, any machine learning procedure depends on hyperparameters that must be chosen, and many procedures are available for any given learning problem: both are an estimator selection problem. Second, with highdimensional and/or large data, the computational complexity of algorithms must be taken into account differently, leading to possible tradeoffs between statistical accuracy and complexity, for machine learning procedures themselves as well as for estimator selection procedures. Third, real data are almost always corrupted partially, making it necessary to provide learning (and estimator selection) procedures that are robust to outliers and heavy tails, while being able to handle large datasets. Fourth, science currently faces a reproducibility crisis, making it necessary to provide statistical inference tools (pvalues, confidence regions) for assessing the significance of the output of any learning algorithm (including the tuning of its hyperparameters), in a computationally efficient way.
3.2 Estimator selection
An important goal of Celeste is to build and study procedures that can deal with general estimators (especially those actually used in practice, which often rely on some optimization algorithm), such as crossvalidation and Lepski's method. In order to be practical, estimator selection procedures must be fully datadriven (that is, not relying on any unknown quantity), computationally tractable (especially in the highdimensional setting, for which specific procedures must be developed) and robust to outliers (since most real data sets include a few outliers). Celeste aims at providing a precise theoretical analysis (for new and existing popular estimator selection procedures), that explains as well as possible their observed behaviour in practice.
3.3 Relating statistical accuracy to computational complexity
When several learning algorithms are available, with increasing computational complexity and statistical performance, which one should be used, given the amount of data and the computational power available? This problem has emerged as a key question induced by the challenge of analyzing large amounts of data – the “big data” challenge. Celeste wants to tackle the major challenge of understanding the timeaccuracy tradeoff, which requires providing new statistical analyses of machine learning procedures – as they are done in practice, including optimization algorithms – that are precise enough in order to account for differences of performance observed in practice, leading to general conclusions that can be trusted more generally. For instance, we study the performance of ensemble methods combined with subsampling, which is a common strategy for handling big data; examples include random forests and medianofmeans algorithms.
3.4 Robustness to outliers and heavy tails (with tractable algorithms)
The classical theory of robustness in statistics has recently received a lot of attention in the machine learning community. The reason is simple: large datasets are easily corrupted, due to – for instance – storage and transmission issues, and most learning algorithms are highly sensitive to dataset corruption. For example, the lasso can be completely misled by the presence of even a single outlier in a dataset. A major challenge in robust learning is to provide computationally tractable estimators with optimal subgaussian guarantees. A second important challenge in robust learning is to deal with datasets where every $({x}_{i},{y}_{i})$ is slightly corrupted. In largedimensional data, every single data point ${x}_{i}$ is likely to have several corrupted coordinates, and no estimator currently has strong theoretical guarantees for such data. A third important challenge is that of robust estimator selection or aggregation. Even if several robust estimators can be built, the final aggregation or selection step in a user's routine is usually based on empirical means. This is not robust, and may damage the global performance of the procedure. Instead, we can consider more sophisticated types of aggregation of the base robust estimators built so far. A convenient framework to do so is called adversarial learning (also known as: prediction of individual sequences). Here, data is not assumed to be stochastic, and it could even be chosen by an adversary.
3.5 Statistical inference: (multiple) tests and confidence regions (including postselection)
Celeste considers the problems of quantifying the uncertainty of predictions or estimations (thanks to confidence intervals) and of providing significance levels ($p$values, corrected for multiplicity if needed) for each “discovery” made by a learning algorithm. This is an important practical issue when performing feature selection – one then speaks of postselection inference – changepoint detection or outlier detection, to name but a few. We tackle it in particular through a collaboration with the Parietal team (Inria Saclay) and LBBE (CNRS), with applications in neuroimaging and genomics.
4 Application domains
4.1 Neglected tropical diseases
Celeste collaborates with Anavaj Sakuntabhai and Philippe Dussart (Pasteur Institute) on predicting dengue severity using only lowdimensional clinical data obtained at hospital arrival. Further collaborations are underway in dengue fever and encephalitis with researchers at the Pasteur Institute, including with JeanDavid Pommier.
4.2 Covid19
We collaborate with researchers at the Pasteur Institute and the University Hospital of Guadeloupe on the development of a rapid test for Covid19 severity prediction as well as risk modeling and outcome prediction for patients admitted to ICU units.
4.3 Electricity load consumption: forecasting and control
Celeste has a longterm collaboration with EDF R&D on electricity consumption. An important problem is to forecast consumption. We currently work on an approach involving back and forth disaggregation (of the total consumption into the consumptions of wellchosen groups/regions) and aggregation of local estimates. We also work on consumption control by price incentives sent to specific users (volunteers), seeing it as a bandit problem.
4.4 Reliability
Collected product lifetime data is often nonhomogeneous, affected by production variability and differing realworld usage. Usually, this variability is not controlled or observed in any way, but needs to be taken into account for reliability analysis. Latent structure models are flexible models commonly used to model unobservable causes of variability.
Celeste currently collaborates with PSA Group. To dimension its vehicles, the PSA Group uses a reliability design method called StrengthStress, which takes into consideration both the statistical distribution of part strength and the statistical distribution of customer load (called Stress). In order to minimize the risk of inservice failure, the probability that a “severe” customer will encounter a weak part must be quantified. Severity quantification is not simple since vehicle use and driver behaviour can be “severe” for some types of materials and not for others. The aim of the study is thus to define a new and richer notion of “severity” from PSA databases, resulting either from tests or client usages. This will lead to more robust and accurate parts dimensioning methods. Two CIFRE theses are in progress on such subjects:
Olivier COUDRAY, “Fatigue Databased Design: Probabilistic Modeling of Fatigue Behavior and Analysis of Fatigue Data to Assist in the Numerical Design of a Mechanical Part”. Here, we are seeking to build probabilistic fatigue criteria to identify the critical zones of a mechanical part.
Emilien BAROUX, “Reliability dimensioning under complex loads: from specification to validation”. Here, we seek to identify and model the critical loads that a vehicle can undergo according to its usage profile (driver, roads, climate, etc.).
4.5 Spectroscopic imaging analysis of ancient materials
Ancient materials, encountered in archaeology and paleontology are often complex, heterogeneous and poorly characterized before physicochemical analysis. A popular technique is to gather as much physicochemical information as possible, is spectromicroscopy or spectral imaging, where a full spectra, made of more than a thousand samples, is measured for each pixel. The produced data is tensorial with two or three spatial dimensions and one or more spectral dimensions, and requires the combination of an "image" approach with a "curve analysis" approach. Since 2010 Celeste (previously Select) collaborates with Serge Cohen (IPANEMA) on clustering problems, taking spatial constraints into account.
4.6 Forecast of dwell time during train parking at stations
This is a Cifre PhD in collaboration with the SNCF.
One of the factors in the punctuality of trains in dense areas (and management crises in the event of an incident on a line) is the respect of both the travel time between two stations and the parking time in a station. These depend, among other things, on the train, its mission, the schedule, the instantaneous charge, and the configuration of the platform or station. Preliminary internal studies at the SNCF have shown that the problem is complex. From a dataset concerning the line E of the Transilien in Paris, we aim to address prediction (machine learning) and modeling (statistics): (1) construct a model of stationhours, stationhourstype of train, by example using coclustering techniques; (2) study the correlations between the number of passengers (load), up and down flows, and parking times, and possibly other variables to be defined; (3) model the flows or loads (within the same station, or the same train) as a stochastic process; (4) develop a realistic digital simulator of passenger flows and test different scenarios of incidents and resolution, in order to propose effective solutions.
4.7 Algorithmic fairness
Machine learning algorithms make pivotal decisions, which influence our lives on a daily basis, using data about individuals. Recent studies show that imprudent use of these algorithms leads to unfair and discriminating decisions, often inheriting or even amplifying disparities present in data. The goal of this research program is to design and analyze novel tractable algorithms that, while still optimizing prediction performance, mitigate or remove unfair decisions of the learned predictor. A major challenge in the machine learning fairness literature is to obtain algorithms which satisfy fairness and risk guarantees simultaneously. Several empirical studies suggest that there is a tradeoff between fairness and accuracy of a learned model – more accurate models are less fair. A theoretical study of these types of tradeoffs is among the main directions of this research project. The goal is to provide userfriendly statistical quantification of these tradeoffs and build statistically optimal algorithms in this context.
5 Social and environmental responsibility
5.1 Footprint of research activities
Influenced in particular by the Covid19 pandemic in 2020, the carbon emissions of Celeste team members related to their jobs were very low and came essentially from:
 limited levels of transport to and from work, and not from travel to conferences.
 electronic communication (email, Google searches, Zoom meetings, online seminars, etc.).
 the carbon emissions embedded in their personal computing devices (construction), either laptops or desktops.
 electricity for personal computing devices and for the workplace, plus also water, heating, and maintenance for the latter. Note that only 7.1% (2018) of France's electricity is not sourced from nuclear energy or renewables so team member carbon emissions related to electricity are minimal.
In terms of magnitude, the largest per capita ongoing emissions (excluding flying) are likely simply to be those from buying computers that have a carbon footprint from their construction, in the range of 100 kg Co2e each. In contrast, typical email use per year is around 10 kg Co2e per person, and a Zoom call comes to around 10g Co2e per hour per person, while web browsing uses around 100g Co2e per hour. Consequently, 2020 was a very low carbon year for the Celeste team. To put this in the context of work travel by flying, one return ParisNice flight corresponds to 160 kg Co2e emissions, which likely dwarfs the total emissions of any one Celeste team member's workrelated emissions in 2020.
The approximate (rounded for simplicity) Co2e values cited above come from the book, “How Bad are Bananas” by Mike BernersLee (2020) which estimates carbon emissions in everyday life.
5.2 Impact of research results
In addition to the longterm impact of our theoretical works—which is of course impossible to assess immediately—we are involved in several applied research projects which aim at having a short/midterm positive impact on society.
First, we collaborate with the Pasteur Institute and the University Hospital of Guadeloupe on medical issues related to some neglected tropical diseases and to Covid19.
Second, the broad use of artificial intelligence/machine learning/statistics nowadays comes with several major ethical issues, one being to avoid making unfair or discriminatory decisions. Our theoretical work on algorithmic fairness has already led to several “fair” algorithms that could be widely used in the short term (one of them is already used for enforcing fair decisionmaking in student admissions at the University of Genoa).
Third, we expect shortterm positive impact on society thanks to several direct collaborations with companies such as EDF (forecasting and control of electricity load consumption), SNCF (punctuality of trains in denselypopulated regions, 1 Cifre contract ongoing) and the PSA group (reliability, with 2 Cifre contracts ongoing).
6 Highlights of the year
6.1 Awards
S. Arlot is junior member of Institut Universitaire de France (IUF) since September 2020.
The paper 2, firstauthored by E. Chzhen, has been selected for an oral presentation at NeurIPS 2020 (1.1% of submitted works accepted).
7 New software and platforms
7.1 New software
7.1.1 BlockCluster
 Name: Block Clustering
 Keywords: Statistic analysis, Clustering package
 Scientific Description: Simultaneous clustering of rows and columns, usually designated by biclustering, coclustering or block clustering, is an important technique in two way data analysis. It consists of estimating a mixture model which takes into account the block clustering problem on both the individual and variables sets. The blockcluster package provides a bridge between the C++ core library and the R statistical computing environment. This package allows to cocluster binary, contingency, continuous and categorical datasets. It also provides utility functions to visualize the results. This package may be useful for various applications in fields of Data mining, Information retrieval, Biology, computer vision and many more.
 Functional Description: BlockCluster is an R package for coclustering of binary, contingency and continuous data based on mixture models.
 Release Contributions: Initialization strategy enhanced

URL:
http://
cran. rproject. org/ web/ packages/ blockcluster/ index. html  Authors: Parmeet Bhatia, Serge Iovleff, Vincent Brault
 Contacts: Christophe Biernacki, Gilles Celeux, Serge Iovleff
 Participants: Christophe Biernacki, Gilles Celeux, Parmeet Bhatia, Serge Iovleff, Vincent Brault, Vincent Kubicki
 Partner: Université de Technologie de Compiègne
7.1.2 MASSICCC
 Name: Massive Clustering with Cloud Computing
 Keywords: Statistic analysis, Big data, Machine learning, Web Application
 Scientific Description: The web application let users use several software packages developed by INRIA directly in a web browser. Mixmod is a classification library for continuous and categorical data. MixtComp allows for missing data and a larger choice of data types. BlockCluster is a library for coclustering of data. When using the web application, the user can first upload a data set, then configure a job using one of the libraries mentioned and start the execution of the job on a cluster. The results are then displayed directly in the browser allowing for rapid understanding and interactive visualisation.
 Functional Description: The MASSICCC web application offers a simple and dynamic interface for analysing heterogeneous data with a web browser. Various software packages for statistical analysis are available (Mixmod, MixtComp, BlockCluster) which allow for supervised and supervised classification of large data sets.

URL:
https://
massiccc. lille. inria. fr  Contact: Christophe Biernacki
7.1.3 Mixmod
 Name: Manypurpose software for data mining and statistical learning
 Keywords: Data mining, Classification, Mixed data, Data modeling, Big data

Functional Description:
Mixmod is a free toolbox for data mining and statistical learning designed for large and highdimensional data sets. Mixmod provides reliable estimation algorithms and relevant model selection criteria.
It has been successfully applied to marketing, credit scoring, epidemiology, genomics and reliability among other domains. Its particularity is to propose a modelbased approach leading to a lot of methods for classification and clustering.
Mixmod allows to assess the stability of the results with simple and thorough scores. It provides an easytouse graphical user interface (mixmodGUI) and functions for the R (Rmixmod) and Matlab (mixmodForMatlab) environments.

URL:
http://
www. mixmod. org  Authors: Christophe Biernacki, Florent Langrognet, Gérard Govaert, Gilles Celeux
 Contacts: Christophe Biernacki, Gilles Celeux
 Participants: Benjamin Auder, Christophe Biernacki, Florent Langrognet, Gérard Govaert, Gilles Celeux, Remi Lebret, Serge Iovleff
 Partners: CNRS, Université Lille 1, LIFL, Laboratoire Paul Painlevé, HEUDIASYC, LMB
8 New results
8.1 Aggregated HoldOut
Aggregated holdout (Agghoo) is a method which averages learning rules selected by holdout (that is, crossvalidation with a single split).
G. Maillard, S. Arlot and M. Lerasle provided in 11 the first theoretical guarantees on Agghoo, ensuring that it can be used safely: Agghoo performs at worst like holdout when the risk is convex. The same holds true in classification with the 01 risk, with an additional constant factor. For holdout, oracle inequalities are known for bounded losses, as in binary classification. They show that similar results can be proved, under appropriate assumptions, for other riskminimization problems. In particular, an oracle inequality holds true for regularized kernel regression with a Lipschitz loss, without requiring that the $Y$ variable or the regressors be bounded. Numerical experiments show that aggregation brings a significant improvement over holdout and that Agghoo is competitive with crossvalidation.
In another paper 33, G. Maillard studied aggregated hold out for sparse linear regression with a robust loss function. Sparse linear regression methods generally have a free hyperparameter which controls the amount of sparsity, and is subject to a biasvariance tradeoff. This article considers the use of aggregated holdout to aggregate over values of this hyperparameter, in the context of linear regression with the Huber loss function. Aggregated holdout (Agghoo) is a procedure which averages estimators selected by holdout (crossvalidation with a single split). In the theoretical part of the article, it is proved that Agghoo satisfies a nonasymptotic oracle inequality when it is applied to sparse estimators which are parametrized by their zeronorm. In particular, this includes a variant of the Lasso introduced by Zou, Hastie and Tibshirani. Simulations are used to compare Agghoo with crossvalidation. They show that Agghoo performs better than CV when the intrinsic dimension is high and when there are confounders correlated with the predictive covariates.
In his Ph.D. thesis 4, G. Maillard obtained more precise results in a specific setting, showing that Agghoo then strictly improves the performance of any model selection procedure. This is a remarkable result, which is to the best of our knowledge the first result of that kind. It required the use of several advanced mathematical results to be proved.
8.2 Online Orthogonal Matching Pursuit
Greedy algorithms for feature selection are widely used for recovering sparse highdimensional vectors in linear models. In classical procedures, the main emphasis is put on the sample complexity, with little or no consideration of the computation resources required. E.M. Saad and S. Arlot, in collaboration with G. Blanchard proposed in 34 a novel online algorithm, called Online Orthogonal Matching Pursuit (OOMP), for online support recovery in the random design setting of sparse linear regression. Our procedure selects features sequentially, alternating between allocation of samples only as needed to candidate features, and optimization over the selected set of variables to estimate the regression coefficients. Theoretical guarantees about the output of this algorithm are proven and its computational complexity is analysed.
8.3 Aggregation of Multiple Knockoffs
T.B. Nguyen and S. Arlot, in collaboration with J.A. Chevalier and B. Thirion, developped an extension of the knockoff inference procedure, introduced by Barber and Candès [2015]. This new method, called aggregation of multiple knockoffs (AKO), addresses the instability inherent to the random nature of knockoffbased inference. Specifically, AKO improves both the stability and power compared with the original knockoff algorithm while still maintaining guarantees for false discovery rate control. They provided in 13 a new inference procedure, prove its core properties, and demonstrate its benefits in a set of experiments on synthetic and real datasets.
8.4 New results for stochastic bandits
G. Stoltz and H. Hadiji (see 30) studied adaptation to the range for stochastic bandit problems with finitely many arms, each associated with a distribution supported on a given finite range $[m,M]$. They do not assume that the range $[m,M]$ is known, and show that there is a cost for learning this range. Indeed, a new tradeoff between distributiondependent and distributionfree regret bounds arises, which prevents one from simultaneously achieving the typical $lnT$ and $\sqrt{T}$ bounds. For instance, a $\sqrt{T}$ distributionfree regret bound may only be achieved if the distributiondependent regret bounds are at least of order $\sqrt{T}$. We exhibit a strategy for achieving the rates for regret indicated by the new tradeoff.
8.5 Finite continuumarmed bandits
The finite continuumarmed bandit problem arises in many applications where an agent must allocate a finite budget $T$ between a larger number of $N$ actions described by covariates, and each action can only be taken once. Focusing on a nonparametric setting, where the mean reward is an unknown function of a onedimensional covariate, 28 propose an optimal strategy for this problem. Under natural assumptions on the reward function, the optimal regret scales as $O\left({T}^{1/3}\right)$ up to polylogarithmic factors when the budget $T$ is proportional to the number of actions $N$. When $T$ becomes small compared to $N$, a smooth transition occurs. When the ratio $T/N$ decreases from a constant to ${N}^{1/3}$, the regret increases progressively up to the $O\left({T}^{1/2}\right)$ rate encountered in classical continuumarmed bandits.
8.6 Robust risk minimization for machine learning
In collaboration with S. Minsker (USC), T. Mathieu worked on obtaining new excess risk bounds in robust empirical risk minimization. The method proposed in their paper 36 is inspired from the robust risk minimization procedure using medianofmeans estimators in Lecué, Lerasle and Mathieu (2018). The obtained excess risk are faster than the socalled “slow rate of convergence” obtained for the minimization procedure in Lecué, Lerasle and Mathieu (2018) and a slightly modified procedure achieves a minimax rate of convergence under low moment assumptions. Experiments on synthetic corrupted data and a real dataset illustrate the accuracy of the method, showing high performance in classification and regression tasks in a corrupted setting.
8.7 Fairness: statistical guarantees and efficient methods
Until very recently results on algorithmic fairness were almost exclusively focused on classification problems. Yet, in a lot of application domains, continuous outputs are more valuable even if the underlying problem is that of classification (e.g., credit scoring). In collaboration with C. Denis, M. Hebiri (Univ. Gustave Eiffel), L. Oneto (Univ. Genoa), M. Pontil (Istituto Italiano di Tecnologia, Univ. College London), E. Chzhen proposed a postprocessing regression method which enjoys risk and fairness finite sample guarantees in 18. Their approach is based on a carefully chosen discretization of the signal space, essentially reducing the problem of regression to a problem of multiclass classification. Later, in 19 a connection between the problem of finding the optimal fair regression (in the sense of Demographic Parity) and the Wasserstein barycenter problem is derived. This connection allows us to build a datadriven postprocessing method, which avoids the discretization step using the theory of optimal transport. This algorithm enjoys distributionfree fairness guarantees. Under additional assumptions, risk guarantees are also derived. A statistical minimax framework is proposed by E. Chzhen and N. Schreuder (CREST, ENSAE) in 27. This framework is built upon the earlier established connection of fair regression and the optimal transport theory, and allows us to study partially fair predictions. Within the proposed setup, Chzhen and Schreduer quantify the tradeoff between Demographic Parity fairness and squared risk by obtaining a characterization of the Pareto frontier. Finally, they derive a generalproblem dependent lower bound on the risk of any partially fair prediction and confirm its tightness on a Gaussian regression model with systematic groupdependent bias.
8.8 Should the clustering of graphs be bipartite?
When clustering the nodes of a graph, a unique partition of the nodes is usually built, whether the graph is undirected or directed. While this choice is pertinent for undirected graphs, it is debatable for directed graphs because it implies that no difference is made between the clusters of source and target nodes. Defining two different clusterings for source and target nodes leads to considering a kind of bipartite clustering. We examine this question in the context of probabilistic models with latent variables, and compare the use of the stochastic block model (SBM) and the latent block model (LBM). We analyze and discuss this comparison through simulated and real data sets and provide recommendation 32.
8.9 Stastical analyses of standardized micropatterned cells
Live imaging of lysosomal secretion monitored by total internal reflection fluorescence imaging of VAMP7pHluorin is a straightforward way to explore secretion from this compartment. Taking advantage of cell culture on micropatterned surfaces to normalize cell shape, we employed a variety of statistical tools to perform a spatial analysis of secretory patterns. Using Ripley’s K function and a statistical test based on nearest neighbor distance (NND) we confirmed that secretion from lysosomes is not a random process but shows significant clustering 9.
8.10 Comparison of dengue case classification schemes and evaluation of biological changes in different dengue clinical patterns
The World Health Organization (WHO) proposed guidelines on dengue clinical classification in 1997 and more recently in 2009 for the clinical management of patients. The WHO 1997 classification defines three categories of dengue infection according to severity: dengue fever (DF), dengue hemorrhagic fever (DHF), and dengue shock syndrome (DSS). Alternative WHO 2009 guidelines provide a crosssectional classification aiming to discriminate dengue fever from dengue with warning signs (DWWS) and severe dengue (SD). In this study we performed a comparison of the two dengue classifications both from a biological and statistical point of view 7.
8.11 Consistency and asymptotic normality of Latent Block Model estimators
The Latent Block Model (LBM) is a modelbased method to cluster simultaneously the $d$ columns and $n$ rows of a data matrix. Parameter estimation in LBM is a difficult and multifaceted problem. Although various estimation strategies have been proposed and are now well understood empirically, theoretical guarantees about their asymptotic behavior is rather sparse and most results are limited to the binary setting. We have proved in 6 theoretical guarantees in the valued settings. We show that under some mild conditions on the parameter space, and in an asymptotic regime where $log\left(d\right)/n$ and $log\left(n\right)/d$ tend to 0 when $n$ and $d$ tend to infinity, (1) the maximumlikelihood estimate of the complete model (with known labels) is consistent and (2) the loglikelihood ratios are equivalent under the complete and observed (with unknown labels) models. This equivalence allows us to transfer the asymptotic consistency, and under mild conditions, asymptotic normality, to the maximum likelihood estimate under the observed model. Moreover, the variational estimator is also consistent and, under the same conditions, asymptotically normal.
8.12 A quantitative McDiarmid's inequality for geometrically ergodic Markov chains
We state and prove in 8 a quantitative version of the bounded difference inequality for geometrically ergodic Markov chains. Our proof uses the same martingale decomposition as in an earlier result but compared to this paper the exact coupling argument is modified to fill a gap between the strongly aperiodic case and the general aperiodic case.
8.13 Robust machine learning by medianofmeans: theory and practice
We introduce in 10 new estimators for robust machine learning based on medianofmeans (MOM) estimators of the mean of real valued random variables. These estimators achieve optimal rates of convergence under minimal assumptions on the dataset. The dataset may also have been corrupted by outliers on which no assumption is granted. We also analyze these new estimators with standard tools from robust statistics. In particular, we revisit the concept of breakdown point. We modify the original definition by studying the number of outliers that a dataset can contain without deteriorating the estimation properties of a given estimator. This new notion of breakdown number, that takes into account the statistical performances of the estimators, is nonasymptotic in nature and adapted for machine learning purposes. We proved that the breakdown number of our estimator is of the order of (number of observations)*(rate of convergence). For instance, the breakdown number of our estimators for the problem of estimation of a $d$dimensional vector with a noise variance ${\sigma}^{2}$ is ${\sigma}^{2}d$ and it becomes ${\sigma}^{2}slog(d/s)$ when this vector has only $s$ nonzero component. Beyond this breakdown point, we proved that the rate of convergence achieved by our estimator is (number of outliers) divided by (number of observation). Besides these theoretical guarantees, the major improvement brought by these new estimators is that they are easily computable in practice. In fact, basically any algorithm used to approximate the standard Empirical Risk Minimizer (or its regularized versions) has a robust version approximating our estimators. As a proof of concept, we study many algorithms for the classical LASSO estimator. A byproduct of the MOM algorithms is a measure of depth of data that can be used to detect outliers.
8.14 A binned technique for scalable modelbased clustering on huge datasets
Clustering is impacted by the regular increase of sample sizes which provides opportunity to reveal information previously out of scope. However, the volume of data leads to some issues related to the need of many computational resources and also to high energy consumption. Resorting to binned data depending on an adaptive grid is expected to give proper answer to such green computing issues while not harming the quality of the related estimation. After a brief review of existing methods, a first application in the context of univariate modelbased clustering is provided in 12, with a numerical illustration of its advantages. Finally, an initial formalization of the multivariate extension is done, highlighting both issues and possible strategies.
9 Bilateral contracts and grants with industry
9.1 Bilateral contracts with industry
 G. Stoltz: New contract with BNP Paribas (10 kE), on stochastic bandits under budget constraints, for an application to loan management. New contract with EDF R&D on studying the Covid19 impact on electricity demand (with Solenne Gaucher as research engineer).
 C. KERIBIN and P. PAMPHILE. OpenLabIA InriaGroupe PSA collaboration contract. 85 KE.
 A. CONSTANTINESCU and P. PAMPHILE. Collaboration contract with Groupe PSA. 95 KE.
10 Partnerships and cooperations
10.1 International initiatives
10.1.1 Inria associate team not involved in an IIL
C. Keribin collaborates with Christophe Biernacki (INRIAModal) on unsupervised learning of huge datasets with limited computer resources. A coadvised thesis (DGA grant) is ongoing.
10.2 National initiatives
10.2.1 ANR
Sylvain Arlot and Matthieu Lerasle are part of the ANR grant FASTBIG (Efficient Statistical Testing for highdimensional Models: application to Brain Imaging and Genetics), which is lead by Bertrand Thirion (Inria Saclay, Parietal).
Sylvain Arlot and Christophe Giraud are part of the ANR ChairIA grant Biscotte, which is led by Gilles Blanchard (Université Paris Saclay).
11 Dissemination
11.1 Promoting scientific activities
11.1.1 Scientific events: organisation
Member of the organizing committees
C. Giraud: Coorganizer with Estelle Kuhn of the conference “StatMathAppli”, to occur in August 2021.
11.1.2 Scientific events: selection
Member of the conference program committees
 S. Arlot: aera chair for AISTATS 2021
 C. Keribin: Scientific VP for Federated learning workshops (with SFdS and Owkin)
 C. Giraud: in charge of the "Highdimensional statistics" session of the BernoulliIMS symposium, August 2020
 C. Giraud: in charge of a special session "Statistical learning" in the AMSSMFEMS joint international meeting, July 2021.
 C. Giraud: program committee COLT, July 2021
Reviewer
We performed many reviews for various international conferences.
11.1.3 Journal
Member of the editorial boards
 S. Arlot: associate editor for Annales de l'Institut Henri Poincaré B – Probability and Statistics
 G. Stoltz: associate editor for Mathematics of Operations Research
Reviewer  reviewing activities
We performed many reviews for various international journals.
11.1.4 Invited talks
S. Arlot, Statistics seminar of LPSM, Paris, 01/12/2020.
C. Keribin, Statistics seminar AgroParisTech, Paris, 18/05/2020
C. Keribin, Statistics seminar INRAEMaIAGE, Jouy en Josas, 14/12/2020
C. Keribin, ERCIMCMStatistics, online, 20/12/2020
E. Chzhen, Le Seminaire Palaisien, online, 06/10/2020
E. Chzhen, Statistics seminar, AgroParisTech, Paris, 02/03/2020
E. Chzhen, Stat$\xb7$Eco$\xb7$ML Seminar at ENSAE Paris, Palaiseau, 05/02/2020
11.1.5 Leadership within the scientific community
C. Keribin is President of the MALIA (Machine Learning and IA) group of the French Statistical Society (SFdS).
11.1.6 Research administration
 S. Arlot coordinates the mathAI (mathematics for artificial intelligence) program of the Labex Mathématique Hadamard and is member of the executive comittee of Fondation Mathématique Jacques Hadamard (FMJH).
 S. Arlot is member of the steering committee of the ParisSaclay Center for Data Science.
 S. Arlot is member of the (temporary) board of the Computer Science Graduate School of University ParisSaclay.
 S. Arlot is member of the (temporary) board of the Mathematics Graduate School of University ParisSaclay.
 S. Arlot is member of the board of the Computer Science Doctoral School (ED STIC) of University ParisSaclay.
 C. Giraud has coordinated the mathSV (mathematics for life science) program of the Labex Mathématique Hadamard and is member of the executive comittee of Fondation Mathématique Jacques Hadamard (FMJH).
 C. Giraud is member of the scientific committee of the Labex IRMIA (Strasbourg)
 C. Giraud is local member of the scientific committee of the Pascal Institute (Saclay)
 C. Giraud is member of the steering committee of the Mathematics Graduate School of University ParisSaclay.
 C. Giraud is in charge of the whole master program in Mathematics of Paris Saclay.
 C. Keribin is elected member of the steering committee of Labex LMH and FMJH foundation
 C. Keribin is elected member of CAC ParisSaclay
 C. Keribin is member of the jury for awarding the ParisSaclay Idex and the FMJH Sophie Germain scholarships
 C. Keribin is in charge of Master 1 Applied Mathematics and Master 2 Datascience of ParisSaclay
 P. Massart is Director of the Fondation Mathématique Jacques Hadamard (FMJH).
11.2 Teaching  Supervision  Juries
11.2.1 Teaching
Most of the team members (especially Professors, Associate Professors and Ph.D. students) teach several courses at University ParisSaclay, as part of their teaching duty. We mention below some of the classes in which we teach.
 Licence: S. Arlot, Probability and Statistics, 68h, L2, Université ParisSud
 Master: S. Arlot, Statistical learning and resampling, 30h, M2, Université ParisSud
 Master: S. Arlot, Probability and Statistics M2 seminar, 30h, M2, Université ParisSud
 Master: S. Arlot, Preparation to French mathematics agrégation (statistics), 50h, M2, Université ParisSud
 Master: C. Giraud, HighDimensional Probability and Statistics, 45h, M2, Université ParisSaclay
 Master: C. Giraud, Mathematics for AI, 75h, M1, Université ParisSaclay
 Master: C. Keribin, unsupervised and supervised learning, M1, 42h, Université ParisSaclay
11.2.2 Supervision
 PhD defended on December 4, 2020: Hédi Hadiji, Sur quelques questions d'adaptation dans des problèmes de bandits stochastiques, started September 2017, coadvised by G. Stoltz and P. Massart
 PhD defended on December 10, 2020: Margaux Brégère, Algorithmes de bandits stochastiques pour la gestion de la demande électrique, started October 2017, manuscript: 21, coadvised by G. Stoltz, P. Gaillard and Y. Goude
 PhD defended on November 24, 2020: Yann Issartel, Inférence sur des graphes aléatoires, started Sep. 2017, advised by C. Giraud.
 PhD defended on September 23, 2020: Malo Huard, Apprentissage et prévision séquentiels : bornes uniformes pour le regret linéaire et séries temporelles hiérarchiques, started October 2016, advised by G. Stoltz.
 PhD defended on September 20, 2020: Guillaume Maillard, Aggregated crossvalidation, started Sept. 2016, coadvised by S. Arlot and M. Lerasle.
 PhD in progress: El Mehdi Saad, Interactions between statistical and computational aspects in machine learning, started Sept. 2019, coadvised by S. Arlot and G. Blanchard
 PhD in progress: TuanBinh Nguyen, Efficient Statistical Testing for HighDimensional Models, started Oct. 2018, coadvised by S. Arlot and B. Thirion
 PhD in progress: Rémi Coulaud, Forecast of dwell time during train parking at station, started Oct. 2019, coadvised by G. Stoltz and C. Keribin, Cifre with SNCF
 PhD in progress: Olivier Coudray, Fatigue databased design, started Nov. 2019, coadvised by C. Keribin and P. Pamphile, Cifre with Groupe PSA
 PhD in progress: Filippo Antonnazo, Unsupervised learning of huge datasets with limited computer resources, started Nov. 2019, coadvised by C. Biernacki (INRIAModal) and C. Keribin, DGA grant
 PhD in progress: Solenne Gaucher, Sequential learning in random networks, started Sep. 2018, C. Giraud.
 PhD in progress: Karl Hajjar, analyse dynamique de réseaux de neurones, started Oct. 2020, C. Giraud and L. Chizat.
 PhD in progress: Emilien Baroux, Reliability dimensioning under complex loads: from specification to validation, started July. 2020, coadvised by A. Constantinescu and P. Pamphile , CIFRE with Groupe PSA
11.2.3 Juries
 S. Arlot: referee for the HdR of Erwan Scornet, Université ParisSaclay, 17/12/2020.
 S. Arlot: member of the HdR committee of VictorEmmanuel Brunel, Institut Polytechnique de Paris, 15/09/2020.
 S. Arlot: member of the HdR committee of Guillem Rigaill, Université ParisSaclay, 18/09/2020.
 S. Arlot: member of the PhD committee of Baptiste Barreau, Université ParisSaclay, 15/09/2020.
 S. Arlot: member of the PhD committee of Gautier Appert, Institut Polytechnique de Paris, 29/10/2020.
 C. Giraud: many HDR and PhD juries as referee or member of the committee
 G. Stoltz: reviewer of the PhD manuscripts by Vincent Margot (Sorbonne Université, October 2020) and Julien Seznec (Inria Lille, December 2020), and of the HDR manuscript by Emilie Kaufmann (Inria Lille, November 2020)
 C. Keribin: reviewer of the PhD manuscript by Léa Longepierre (Sorbonne University, July 2020)
 C. Keribin: member of the PhD committee of Eva Lawrence (Université Paul Sabatier, December 2020)
11.3 Popularization
11.3.1 Interventions
S. Arlot is member of the steering committee of a generalaudience exhibition about artificial intelligence (“Entrez dans le monde de l'IA”), that is coorganized by Fermat Science (Toulouse), Institut Henri Poincaré (IHP, Paris) and Maison des Mathématiques et de l'Informatique (MMI, Lyon).
12 Scientific production
12.1 Major publications
 1 unpublished'Estimating parameters of the Weibull Competing Risk model with Masked Causes and Heavily Censored Data'.October 2020, working paper or preprint
 2 inproceedings 'Fair Regression with Wasserstein Barycenters'. NeurIPS 2020  34th Conference on Neural Information Processing Systems Vancouver / Virtuel, Canada December 2020

3
unpublished'Adaptation to the Range in
$K$ Armed Bandits'.November 2020, working paper or preprint  4 phdthesis 'Holdout and Aggregated holdout'. Université ParisSaclay September 2020

5
inproceedings
'Maintien en conditions opérationnelles d'une flotte de véhicules : estimation du besoin en pièce de rechange'.
Econgrès 2020 Lambda
$$ $$ 22  22e Congrès de Maîtrise des Risques et Sûreté de Fonctionnement$$ $$ 22 Le Havre / Virtual, France August 2020
12.2 Publications of the year
International journals
 6 article'Consistency and Asymptotic Normality of Latent Blocks Model Estimators'.Electronic journal of statistics 141March 2020, 12341268
 7 article'Comparison of dengue case classification schemes and evaluation of biological changes in different dengue clinical patterns in a longitudinal followup of hospitalized children in Cambodia'.PLoS Neglected Tropical Diseases1492020, e0008603
 8 article 'A quantitative Mc Diarmid's inequality for geometrically ergodic Markov chains'. Electronic Communications in Probability February 2020
 9 article 'Quantifying Spatiotemporal Parameters of Cellular Exocytosis in Micropatterned Cells'. Journal of visualized experiments : JoVE 163 2020
 10 article 'Robust machine learning by medianofmeans : theory and practice'. Annals of Statistics May 2020
 11 article'Crossvalidation improved by aggregation: Agghoo'.Journal of Machine Learning Research2220February 2021, 155
International peerreviewed conferences
 12 inproceedings 'A binned technique for scalable modelbased clustering on huge datasets'. MBC2  Models and Learning for Clustering and Classification Journal ADAC  Advances in Data Analysis and Classification, Catania, Italy September 2020
 13 inproceedings 'Aggregation of Multiple Knockoffs'. Proceedings of the 37th International Conference on Machine Learning, PMLR 119, 2020 ICML 2020  37th International Conference on Machine Learning Proceedings of the ICML 37th International Conference on Machine Learning, 119 Vienne / Virtual, Austria July 2020
National peerreviewed Conferences
 14 inproceedings 'Estimation of univariate Gaussian mixtures for huge raw datasets by using binned datasets'. JDS2020 Nice, France May 2020
 15 inproceedings 'Characterization of critical areas for mechanical part fatigue design'. Econgrès 2020 Lambda λµ22  22e Congrès de Maîtrise des Risques et Sûreté de Fonctionnement λµ22 Le Havre / Virtual, France August 2020
 16 inproceedings 'Characterization of critical areas for mechanical part fatigue design'. SFdS2020  52èmes Journées de Statistiques de la Société Française de Statistique Nice, France May 2020
 17 inproceedings 'Quels modèles pour le temps de stationnement des trains en Île de France ?' SFdS 2020  52èmes Journées de Statistiques de la Société Française de Statistiques Nice, France May 2020
Conferences without proceedings
 18 inproceedings 'Fair Regression via Plugin Estimator and Recalibration With Statistical Guarantees'. NeurIPS 2020  34th Conference on Neural Information Processing Systems Vancouver / Virtuel, Canada December 2020
 19 inproceedings 'Fair Regression with Wasserstein Barycenters'. NeurIPS 2020  34th Conference on Neural Information Processing Systems Vancouver / Virtuel, Canada December 2020
 20 inproceedings 'Maintien en conditions opérationnelles d'une flotte de véhicules : estimation du besoin en pièce de rechange'. Econgrès 2020 Lambda λµ22  22e Congrès de Maîtrise des Risques et Sûreté de Fonctionnement λµ22 Le Havre / Virtual, France August 2020
Doctoral dissertations and habilitation theses
 21 thesis 'Stochastic bandit algorithms for demand side management'. Université ParisSaclay December 2020
 22 thesis 'Localization methods with applications to robust learning and interpolation'. Institut Polytechnique de Paris June 2020
 23 thesis 'Sequential learning and prediction : uniform regret bounds and hierarchical time series'. Université ParisSaclay September 2020
 24 thesis 'Inference on random networks'. Faculté des sciences d'Orsay, Université ParisSaclay November 2020
Reports & preprints
 25 misc 'A review of electric vehicle load open data and models'. November 2020
 26 misc 'Estimating parameters of the Weibull Competing Risk model with Masked Causes and Heavily Censored Data'. October 2020
 27 misc 'A minimax framework for quantifying riskfairness tradeoff in regression'. December 2020
 28 misc 'Finite ContinuumArmed Bandits'. November 2020
 29 misc 'DiversityPreserving KArmed Bandits, Revisited'. October 2020

30
misc
'Adaptation to the Range in
$K$ Armed Bandits'. November 2020  31 misc 'Hierarchical robust aggregation of sales forecasts at aggregated levels in ecommerce, based on exponential smoothing and Holt's linear trend method'. June 2020
 32 misc 'Cluster or cocluster the nodes of oriented graphs?' February 2021
 33 misc 'Aggregated hold out for sparse linear regression with a robust loss function'. February 2020
 34 misc 'Online Orthogonal Matching Pursuit'. February 2021
Other scientific publications
 35 misc 'An example of prediction which complies with Demographic Parity and equalizes groupwise risks in the context of regression'. Vancouver, Canada December 2020
12.3 Cited publications
 36 unpublished'Excess risk bounds in robust empirical risk minimization'.December 2019, working paper or preprint