The research domain for the select project is statistics. Statistical methodology has made great progress over the past few decades, with a variety of statistical learning software packages that support many different methods and algorithms. Users now face the problem of choosing among them, to select the most appropriate method for their data sets and objectives. The problem of model selection is an important but difficult problem, both theoretically and practically. Classical model selection criteria, which use penalized minimum-contrast criteria with fixed penalties, are often based on unrealistic assumptions.

select aims to provide efficient model selection criteria with data-driven penalty terms. In this context, select aims to improve the toolkit of statistical model selection criteria from both theoretical and practical perspectives. Currently, select is focusing its effort on variable selection in statistical learning, hidden-structure models and supervised classification. Its domains of application concern reliability, curve classification, phylogenetic analysis and classification in genetics. New developments in select activities are concerned with applications in biostatistics (statistical analysis of medical images) and biology.

From applications we treat on a day-to-day basis, we have learned that some assumptions currently used in asymptotic theory for model selection are often irrelevant in practice. For instance, it is not realistic to assume that the target belongs to the family of models in competition. Moreover, in many situations, it is useful to make the size of the model depend on the sample size, which makes asymptotic analyses breakdown. An important aim of select is to propose model selection criteria which take such practical constraints into account.

An important goal of select is to build and analyze penalized log-likelihood model selection criteria that are efficient when the number of models in competition grows to infinity with the number of observations. Concentration inequalities are a key tool for this, and lead to data-driven penalty choice strategies. A major research direction for select consists of deepening the analysis of data-driven penalties, both from the theoretical and practical points of view. There is no universal way of calibrating penalties, but there are several different general ideas that we aim to develop, including heuristics derived from Gaussian theory, special strategies for variable selection, and resampling methods.

Choosing a model is not only difficult theoretically. From a practical point of view, it is important to design model selection criteria that accommodate situations in which the data probability distribution P is unknown, and which take the model user's purpose into account. Most standard model selection criteria assume that P belongs to one of a set of models, without considering the purpose of the model. By also considering the model user's purpose, we can avoid or overcome certain theoretical difficulties, and produce flexible model selection criteria with data-driven penalties. The latter is useful in supervised classification and hidden-structure models.

The Bayesian approach to statistical problems is fundamentally probabilistic: a joint probability distribution is used to describe the relationships among all unknowns and the data. Inference is then based on the posterior distribution, i.e., the conditional probability distribution of the parameters given the observed data. Exploiting the internal consistency of the probability framework, the posterior distribution extracts relevant information in the data and provides a complete and coherent summary of post-data uncertainty. Using the posterior to solve specific inference and decision problems is then straightforward, at least in principle.

A key goal of select is to produce methodological contributions in statistics. For this reason, the select team works with applications that serve as an important source of interesting practical problems and require innovative methodology to address them. Many of our applications involve contracts with industrial partners, e.g., in reliability, although we also have several academic collaborations, e.g., in genetics and image analysis.

The field of classification for complex data such as curves, functions, spectra and time series, is an important problem in current research. Standard data analysis questions are being looked into anew, in order to define novel strategies that take the functional nature of such data into account. Functional data analysis addresses a variety of applied problems, including longitudinal studies, analysis of fMRI data, and spectral calibration.

We are focused in particular on unsupervised classification. In addition to standard questions such as the choice of the number of clusters, the norm for measuring the distance between two observations, and vectors for representing clusters, we must also address a major computational problem: the functional nature of the data, which requires new approaches.

For several years now, select has collaborated with the EDF-DER *Maintenance des Risques Industriels* group.
One important theme involves the resolution of inverse problems using simulation tools to
analyze incertainty in highly complex physical systems.

The other major theme concerns reliability, through a research collaboration with Nexter involving a Cifre convention. This collaboration concerns a lifetime analysis of a vehicle fleet to assess ageing.

Moreover, a collaboration is ongoing with Dassault Aviation on the modal analysis of mechanical structures, which aims to identify the vibration behavior of structures under dynamic excitation. From the algorithmic point of view, modal analysis amounts to estimation in parametric models on the basis of measured excitations and structural response data. In literature and existing implementations, the model selection problem associated with this estimation is currently treated by a rather weighty and heuristic procedure. In the context of our own research, model selection via penalization methods are being tested on this model selection problem.

For many years now, select collaborates with Marie-Laure Martin-Magniette (URGV) for the analysis of genomic data. An important theme of this collaboration is using statistically sound model-based clustering methods to discover groups of co-expressed genes from microarray and high-throughput sequencing data. In particular, identifying biological entities that share similar profiles across several treatment conditions, such as co-expressed genes, may help identify groups of genes that are involved in the same biological processes.

Yann Vasseur has completed a thesis co-supervised by Gilles Celeux and Marie-Laure Martin-Magniette on this topic, which is also an interesting investigation domain for the latent block model developed by select. For this work, Yann Vasseur dealt with high-dimensional ill-posed problems where the number of variable was almost equal to the number of observations. He designed heuristic tools using regularized regression methods to circumvent this difficulty.

select collaborates with Anavaj Sakuntabhai and Philippe Dussart (Pasteur Institute) on predicting dengue severity using only low-dimensional clinical data obtained at hospital arrival. An ongoing project also involves statistical meta-analysis of newly collected dengue gene expression data along with recently published data sets from other groups. Further collaborations are underway in dengue fever and encephalitis with researchers at the Pasteur Institute.

select collaborates with Inserm/Paris-Saclay researchers at Kremlin-Bicêtre hospital on cyclic transcriptional clocks and renal corticosteroid signaling, developing statistical tests for synchronous signals.

select is involved in the ANR “jeunes chercheurs” MixStatSeq directed by Cathy Maugis (INSA Toulouse), which is concerned with statistical analysis and clustering of RNASeq genomics data.

A collaboration is ongoing with Pascale Tubert-Bitter, Ismael Ahmed and Mohamed Sedki (Pharmacoepidemiology and Infectious Diseases, PhEMI) for the analysis of pharmacovigilance data. In this framework, the goal is to detect, as soon as possible, potential associations between certain drugs and adverse effects, which appeared after the authorized marketing of these drugs. Instead of working on aggregate data (contingency table) like is usually the case, the approach developed aims to deal with individual's data, which perhaps gives more information. Valerie Robert has completed a thesis co-supervised by Gilles Celeux and Christine Keribin on this topic, which involved the development of a new model-based clustering method, inspired by latent block models. Morever, she has defined new tools to estimate and assess the block clustering involved in these models.

Ancient materials, encountered in archaeology and paleontology are often complex, heterogeneous and poorly characterized before physico-chemical analysis. A popular technique to gather as much physico-chemical information as possible, is spectro-microscopy or spectral imaging, where a full spectra, made of more than a thousand samples, is measured for each pixel. The produced data is tensorial with two or three spatial dimensions and one or more spectral dimensions, and requires the combination of an “image” approach with a “curve analysis” approach. Since 2010 select, collaborates with Serge Cohen (IPANEMA) on the development of conditional density estimation through GMM, and non-asymptotic model selection, to perform stochastic segmentation of such tensorial datasets. This technique enables the simultaneous accounting for spatial and spectral information, while producing statistically sound information on morphological and physico-chemical aspects of the studied samples.

*Block Clustering*

Keywords: Statistic analysis - Clustering package

Scientific Description: Simultaneous clustering of rows and columns, usually designated by biclustering, co-clustering or block clustering, is an important technique in two way data analysis. It consists of estimating a mixture model which takes into account the block clustering problem on both the individual and variables sets. The blockcluster package provides a bridge between the C++ core library and the R statistical computing environment. This package allows to co-cluster binary, contingency, continuous and categorical data-sets. It also provides utility functions to visualize the results. This package may be useful for various applications in fields of Data mining, Information retrieval, Biology, computer vision and many more.

Functional Description: BlockCluster is an R package for co-clustering of binary, contingency and continuous data based on mixture models.

Participants: Christophe Biernacki, Gilles Celeux, Parmeet Bhatia, Serge Iovleff, Vincent Brault and Vincent Kubicki

Partner: Université de Technologie de Compiègne

Contact: Serge Iovleff

URL: http://

*Massive Clustering with Cloud Computing*

Keywords: Statistic analysis - Big data - Machine learning - Web Application

Scientific Description: The web application let users use several software packages developed by Inria directly in a web browser. Mixmod is a classification library for continuous and categorical data. MixtComp allows for missing data and a larger choice of data types. BlockCluster is a library for co-clustering of data. When using the web application, the user can first upload a data set, then configure a job using one of the libraries mentioned and start the execution of the job on a cluster. The results are then displayed directly in the browser allowing for rapid understanding and interactive visualisation.

Functional Description: The MASSICCC web application offers a simple and dynamic interface for analysing heterogeneous data with a web browser. Various software packages for statistical analysis are available (Mixmod, MixtComp, BlockCluster) which allow for supervised and supervised classification of large data sets.

Contact: Jonas Renault

*Many-purpose software for data mining and statistical learning*

Keywords: Data modeling - Mixed data - Classification - Data mining - Big data

Functional Description: Mixmod is a free toolbox for data mining and statistical learning designed for large and highdimensional data sets. Mixmod provides reliable estimation algorithms and relevant model selection criteria.

It has been successfully applied to marketing, credit scoring, epidemiology, genomics and reliability among other domains. Its particularity is to propose a model-based approach leading to a lot of methods for classification and clustering.

Mixmod allows to assess the stability of the results with simple and thorough scores. It provides an easy-to-use graphical user interface (mixmodGUI) and functions for the R (Rmixmod) and Matlab (mixmodForMatlab) environments.

Participants: Benjamin Auder, Christophe Biernacki, Florent Langrognet, Gérard Govaert, Gilles Celeux, Remi Lebret and Serge Iovleff

Partners: CNRS - Université Lille 1 - LIFL - Laboratoire Paul Painlevé - HEUDIASYC - LMB

Contact: Gilles Celeux

The well-documented and consistent variable selection procedure in model-based cluster analysis and classification that Cathy Maugis (INSA Toulouse) designed during her PhD thesis in select, makes use of stepwise algorithms which are painfully slow in high dimensions. In order to circumvent this drawback, Gilles Celeux, in collaboration with Mohammed Sedki (Université Paris XI) and Cathy Maugis, have recently submitted an article where variables are sorted using a lasso-like penalization adapted to the Gaussian mixture model context. Using this ranking to select variables, they avoid the combinatory problem of stepwise procedures. The performances on challenging simulated and real data sets are similar to the standard procedure, with a CPU time divided by a factor of more than a hundred.

In collaboration with Jean-Michel Marin (Université de Montpellier) and Olivier Gascuel (LIRMM), Gilles Celeux has continued research
aiming to select a short list of models rather a single model. This short list is declared to be compatible with the data using a

G. Maillard, S. Arlot and M. Lerasle studied a method mixing cross-validation with aggregation, called aggregated hold-out (Agghoo), which is already used by several practitioners. Agghoo can also be related to bagging. According to numerical experiments, Agghoo can improve significantly cross-validation's prediction error, at the same computational cost; this makes it very promising as a general-purpose tool for prediction. This work provides the first theoretical guarantees on Agghoo, in the supervised classification setting, ensuring that one can use it safely: at worst, Agghoo performs like hold-out, up to a constant factor. A non-asymptotic oracle inequality is also proved, in binary classification under the margin condition, which is sharp enough to get (fast) minimax rates.

With G. Lecué, Matthieu Lerasle working on “learning from MOM's principles”, showing that a recent procedure by Lugosi and Mendelson can be derived by applying Le Cam’s “estimation from tests” procedure to MOM's tests. They also showed some robustness properties of these estimators, proving that the rates of convergence of this estimator are not downgraded even if some “outliers” have corrupted the dataset, and the other data have only first and second moments equal to that of the targeted probability distribution.

Gilles Celeux and Serge Cohen have started research in collaboration with Agnès Grimaud (UVSQ) to perform clustering of hyperspectral images which respects spatial constraints. This is a one-class classification problem where distances between spectral images are given by the

Gilles Celeux continued his collaboration with Jean-Patrick Baudry on model-based clustering. This year, they started work on assessing model-based clustering methods on cytometry data sets. The interest of these is that they involve combining clustering and classification tasks in a unified framework.

Gillies Celeux and Julie Josse have started research on missing data for model-based clustering in collaboration with Christophe Biernacki (Modal, Inria Lille). This year, they have proposed a model for mixture analysis involving not missing-at-random mixtures.

In the framework of MASSICCC, Benjamin Auder and Gilles Celeux have started research on the graphical representation of model-based clusters. The aim of this is to better-display proximity between clusters.

For a long time unsolved, the consistency and asymptotic normality of the maximum likelihood and variational estimators of the latent block model were finally tackled and obtained in a joint work with V. Brault and M. Mariadassou.

J-M. Poggi (with R. Genuer, C. Tuleau-Malot, N. Villa-Vialaneix), have published an article on random forests in “big data” classification problems, and have performed a review of available proposals about random forests in parallel environments as well as on online random forests. Three variants involving subsampling, Big Data-bootstrap and MapReduce respectively were tested on two massive datasets, one simulated one, and the other, real-world data.

With G. Lecué, Matthieu Lerasle worked on robust machine learning by median-of-means, providing an alternative to the Lugosi and Mendelson approach based on median of means for learning. This alternative is easier to present and to analyse theoretically. Furthermore, they proposed an algorithm to approximate this estimator, which could not be done for Lugosi and Mendelson's champions of tournaments (submitted).

Jeanne Nguyen is working on estimation for conditional densities in high dimension. Much more informative than the regression function, conditional densities are of high interest in recent methods, particularly in the Bayesian framework (studying the posterior distribution). Considering a specific family of kernel estimators, she is studying a greedy algorithm for selecting the bandwidth. Her method addresses several issues: avoiding the curse of high dimensionality under some suitably defined sparsity conditions, being computationally efficient using iterative procedures, and early variable selection, providing theoretical guarantees on the minimax risk.

Since June 2015, in the framework of a CIFRE convention with Nexter, Florence Ducros has begun a thesis on the modeling of aging of vehicles, supervised by Gilles Celeux and Patrick Pamphile. This thesis should lead to designing an efficient maintenance strategy according to vehicle use profiles. Moreover, warranty cost calculations are made in the context of heterogeneous usages. This required estimations of mixtures and competing risk models in a highly-censored setting.

This year, Patrick Pamphile and Florence Ducros have published an article which proposes a two-component Weibull mixture model for modelling unobserved heterogeneity in heavily censored lifetime data collection. Performance of classical estimation methods (maximum of likelihood, EM, full Bayes and MCMC) are poor due to the high number of parameters and the heavy censoring. Thus, a Bayesian bootstrap method called Bayesian Restoration Maximization, was used. Sampling from the posterior distribution was obtained thanks to an importance sampling technique. Simulation results showed that, even with heavy censoring, BRM is effective both in term of estimate's precision and computation times.

The subject of Yann Vasseur's PhD Thesis, supervised by Gilles Celeux and Marie-Laure Martin-Magniette (INRA URGV), was the inference of a regulatory network for
Transcriptions Factors (TFs), which are specific genes, of *Arabidopsis thaliana*. For this, a transcriptome dataset with a similar number of TFs
and statistical units was available. They reduced the dimension of the network to avoid high-dimensional difficulties. Representing this network
with a Gaussian graphical model, the following procedure was defined:

*Selection step*: choose the set of TF regulators (supports) of each TF.

*Classification step*: deduce co-factor groups (TFs with similar expression levels) from these supports.

Thus, the reduced network would be built on the co-factor groups. Currently, several selection methods based on Gauss-LASSO and resampling procedures have been applied to the dataset. The study of stability and parameter calibration of these methods is in progress. The TFs are clustered with the Latent Block Model into a number of co-factor groups, selected with BIC or the exact ICL criterion. Since these models are built in an ad hoc way, Yann Vasseur has defined complex simulation tools to asses their performances in a proper way.

In collaboration with Benno Schwikowski, Iryna Nikolayeva and Anavaj Sakuntabhai (Pasteur Institute, Paris), Kevin Bleakley worked on using 2-d isotonic regression to predict dengue fever severity at hospital arrival using high-dimensional microarray gene expression data. Important marker genes for dengue severity have been detected, some of which now have been validated in external lab trials, and an article has now been submitted.

In collaboration with researchers from the Pasteur Institute, Kevin Bleakley worked on statistical tests in the context of research into what leads to dengue fever *without symptoms* as opposed to *with* symptoms. This work was published in *Science Translational Medicine*.

Kevin Bleakley has also collaborated with Inserm/Paris-Saclay researchers at Kremlin-Bicêtre hospital on cyclic transcriptional clocks and renal corticosteroid signaling, and has developed novel statistical tests for detecting synchronous signals. This work is submitted.

In collaboration with Pascale Tubert-Bitter, Ismael Ahmed and Mohamed Sedki, Gilles Celeux and Christine Keribin worked on the detection of associations between drugs and adverse events in the framework of the PhD of Valerie Robert, which was defended this year. At first, this team developed model-based clustering inspired by latent block models (LBMs), which consists of co-clustering rows and columns of two binary tables, imposing the same row ranking. This enabled it to highlight subgroups of individuals sharing the same drug profile, and subgroups of adverse effects and drugs with strong interactions. Furthermore, some sufficient conditions are provided to obtain identifiability of the model, and some results are shown for simulated data. The exact ICL criterion has been extended to this double block latent model. Through computer experiments, Valérie Robert demonstrated the interest of the proposed model, compared with standard contingency table analysis, to detect co-prescription and masking effects.

Futhermore, with V. Robert, C. Kerebin and G. Celeux showed that it can be useful to use an LBM model on a contingency table of drugs and adverse effects to do cluster initialization for dealing with individual's data.

In collaboration with Jean-Louis Foulley (Montpellier University), Gilles Celeux and Julie Josse have done research on the statistical rating and ranking of scientific journals. They have proposed Dirichlet multinomial Bayesian models for pagerank-type algorithms allowing self-citations to be excluded. The resulting methods were tested on a set of 47 scientific journals.

In collaboration with R. Diel, Matthieu Lerasle published an article on nonparametric estimation for random walks in random environments. They proposed a non-parametric approach for estimating the distribution of the environment from the observation of one trajectory of a random walk in it. They obtained risk bounds in sup-norm for the cumulative distribution function of the environment.

In collaboration with R. Chetrite and R. Diel, Matthieu Lerasle published an article on the number of potential winners in the Bradley-Terry model in random environments. They proposed the first mathematical study of the Bradley-Terry model where the values of players are i.i.d. realisations of some distribution. They proved that a Bradley-Terry tournament is fair (in the sense that the best player ends up with the largest number of victories) under a certain convexity condition on the tail distribution of the values. They also showed that this condition is sharp and provided sharp estimate of the number of potential winners when the condition fails.

He also collaborated with R. Diel and S. Le Corff on learning latent structures of large random graphs, investigating the possibility of estimating latent structure in sparsely observed random graphs. The main example was a Bradley-Terry tournament where each team has only played a few games. It is well known that individual values of the teams cannot be consistently estimated in this setting. They showed that their distribution on the other hand can be, and provide general tools for bounding the risk of the maximum likelihood estimator (submitted).

select has a contract with Nexter regarding modeling the reliability of vehicles.

Benjamin Auder and Jean-Michel Poggi are participants in the
grant PGMO-IRSDI, in the *Research Initiative In Industrial Data Science* context, on the subject: Disaggregated Electricity Forecasting using Clustering of Individual Consumers.

Gilles Celeux and Christine Keribin have a collaboration with the Pharmacoepidemiology and Infectious Diseases (PhEMI, INSERM) groups.

Sylvain Arlot and Pascal Massart co-organize a working group at ENS (Ulm) on statistical learning.

select is part of the ANR funded MixStatSeq.

Gilles Celeux is one of the co-organizers of the international working group on model-based clustering. This year this workshop took place in Perugia, Italy

Kevin Bleakley stayed at the Pasteur Institute, Cambodia, while working on several collaborations in dengue fever research, from late 2016 until early 2017.

Sylvain Arlot organized (with Guillaume Charpiat) the Workshop Statistics/Learning at Paris-Saclay (2nd edition), at IHES (Bures-sur-Yvette).

Gilles Celeux is one of the co-organizers of the international working group on model-based clustering. This year the workshop took place in Perugia, Italy.

Sylvain Arlot is one of the co-organizers of the Junior Conference on Data Science and Engineering at Paris-Saclay (2nd edition in 2017).

Jean-Michel Poggi was president of the Scientific Program Committee, ENBIS 2017, Naples, 10-14 June 2017.

Jean-Michel Poggi was member of the Conference Scientific Board of IES 2017, Naples, Italy, 6-8 September 2017.

Gilles Celeux is Editor-in-Chief of the *Journal de la SFdS*.
He is Associate Editor of *Statistics and Computing*,
*CSBIGS*.

Pascal Massart is Associate Editor of *Annals of Statistics*,
*Confluentes Mathematici*, and *Foundations and Trends in Machine Learning*.

Jean-Michel Poggi is Associate Editor of *Journal of Statistical Software*, *Journal de la SFdS* and *CSBIGS*.

The members of the team have reviewed numerous papers for numerous international journals.

The members of the team have given many invited talks on their research in the course of 2016.

Jean-Michel Poggi is:

Vice-President ENBIS (European Network for Business and Industrial Statistics), 2015-18

Vice-President FENStatS (Federation of European National Statistical Societies) since 2012

Council Member of the ISI (2015-19)

Member of the Board of Directors of the ERS of IASC (since 2014)

Jean-Michel Poggi is member of the EMS Committee for Applied Mathematics (since 2014).

Jean-Michel Poggi is the president of ECAS (European Courses in Advanced Statistics) since 2015.

Sylvain Arlot coordinates (jointly with Marc Schoenauer, Inria Saclay) the math-STIC program of the Labex Mathématique Hadamard.

Christine Keribin is treasurer of the Société Française de Statistique (SFdS).

select members teach various courses at several different universities, and in particular the Master 2 “Mathématique de l'aléatoire” of Université Paris-Saclay.

PhD: Valérie Robert, 2013, Gilles Celeux and Christine Keribin. Defended in June 2017

PhD : Yann Vasseur, 2013, Gilles Celeux and Marie-Laure Martin-Magniette (URGV). Defended in December 2017

PhD in progress: Neska El Haouij, 2014, Jean-Michel Poggi and Meriem Jaïdane, Raja Ghozi (ENIT Tunisie) and Sylvie Sevestre-Ghalila (CEA LinkLab), Thesis ENITUPS

PhD in progress: Florence Ducros, 2015, Gilles Celeux and Patrick Pamphile

PhD in progress: Claire Brécheteau, 2015, Pascal Massart

PhD in progress: Hedi Hadiji, 2017, Pascal Massart

PhD in progress: Eddie Aamari, 2015, Pascal Massart and Frédéric Chazal

PhD: Damien Garreau, 2013, Sylvain Arlot and Gérard Biau (UPMPC). Defended in October 2017

PhD in progress: Guillaume Maillard, 2016, Sylvain Arlot and Matthieu Lerasle

PhD in progress: Jeanne Nguyen, 2015, Claire Lacour and Vincent Rivoirard (Univ Paris Dauphine)

PhD in progress: Benjamin Goehry, 2015, Pascal Massart and Jean-Michel Poggi

Masters internship: Thomas Prochwicz. Christine Keribin conducted a preliminary study on expert aggregation by supervising this three month internship.

S. Arlot was a member of the Ph.D. jury of Jilai Mei (Université Paris-Sud).