Section: New Results
Model selection in Regression and Classification
Participants : JeanPatrick Baudry, Gilles Celeux, Mohammed El Anbari, Robin Genuer, Pascal Massart, Cathy Maugis, Bertrand Michel, JeanMichel Poggi, Vincent Vandewalle, Nicolas Verzelen.
In collaboration with MarieLaure MartinMagniette (URGV et UMR AgroParisTech/INRA MIA 518), Gilles Celeux and Cathy Maugis developed a variable selection procedure for modelbased clustering in [18] . The problem is regarded as a model selection problem in the modelbased cluster analysis context. They proposed a more versatile variable selection model taking into account three possible roles for each variable: The relevant clustering variables, the irrelevant clustering variables dependent on a part of the relevant clustering variables and the irrelevant clustering variables totally independent of all the relevant variables. This modelling allows to generalize the model of [17] and the one of Raftery and Dean (2006). A model selection criterion based on BIC and a variable selection algorithm called SelvarClustIndep , embedding two backward stepwise variable selection algorithms for clustering and linear regression, are derived for this new variable role modelling. The model identifiability and the consistency of the variable selection criterion are also established. Numerical experiments highlight the interest of this new modeling. Two softwares, socalled SelvarClust and SelvarClustIndep , implemented en C^{ + + } are devoted to the variable selection in modelbased clustering according to the modelling of [17] and [18] respectively. They are available at the following address: http://www.math.univtoulouse.fr/~maugis . Currently, those researchers are interested in taking advantage of this general variable role modelling for discriminant analysis. In particular, the variable selection gives a new interest for the quadratic discriminant analysis.
These variable selection procedures are in particular used for genomics applications which is the result of a collaboration with researchers of of URGV (Evry Genopole).
Cathy Maugis with Bertrand Michel (GEOMETRICA, Inria) consider specific Gaussian mixtures to solve simultaneously variable selection and clustering problems. In [19] , they proposed a non asymptotic penalized criterion to choose the number of mixture components and the relevant variable subset. Because of the non linearity of the associated KullbackLeibler contrast on Gaussian mixtures, a general model selection theorem for MLE proposed by Massart is used to obtain the penalty function form and the associated oracle inequality. This theorem requires controlling the bracketing entropy of mixture families. Nevertheless, these theoretical results depend on unknown constants. Currently, they are interested in establishing an adaptative property of their penalized maximum likelihood estimators in a minimax sense. In [67] , they study the practical use of their penalized criterion. A "slope heuristics" method is applied to calibrate these constants. JeanPatrick Baudry, Cathy Maugis and Bertrand Michel [49] are developping a Matlab package for the use of the slope heuristics of Birgé and Massart (dimension jump, slope estimation, ...). The aim is twofold: first to propose solutions to overcome the practical difficulties involved by its practical application and second to provide a readytouse and easy solution for people who may want to try to apply the slope heuristics and then to encourage its use. They are preparing an overview about the slope heuristics to introduce this package.
With Sylvain Arlot (ENS, CNRS), Pascal Massart [7] studied the socalled slope heuristics in the framework of regression on a random design, with possible heteroscedastic noise. Assuming that all the models are made of histograms, they show the same relationship between a “minimal penalty” and an optimal one. This can for instance be used for tuning a penalty, when the optimal penalty is known up to some multiplicative constant. In general, the optimal shape of the penalty can be estimated by v fold or resampling penalties. Their work is based on new structural concentration inequalities for the empirical risk and the highdimensional Wilks phenomenon enlightened in [9] .
JeanPatrick Baudry and Gilles Celeux [1] continued the study of estimation and model selection procedures derived by minimizing a new contrast adapted for clustering with mixture models and inspired from the ICL criterion. Theoretical results have been obtained about the consistency of the estimator thus defined and about the consistency of the corresponding model selection procedure. Moreover, solutions have been developped to practically compute this estimator, which involves difficulties analagous to those arising when computing the usual maximum likelihood estimator with mixture models. Those solutions may improve the results of the EM algorithm in this usual task, too. They also studied some robustness properties of the proposed estimator.
JeanPatrick Baudry and Gilles Celeux, in collaboration with Ana Maria Ferreira (Lisbon University), proposed a model selection criterion which can be helpful when it can be interesting to find a solution wellrelated to an external classification available a priori . This criterion has been applied to a data set in the professional development field.
In collaboration with Professor Abdallah Mkhadri (University of Marrakesh, Marocco), Gilles Celeux supervised the thesis of Mohammed El Anbari which concern regularisation methods in linear regression. This year, in collaborarion with JeanMichel Marin (Université de Montpellier) they have considered the Bayesian point of view and compared Bayesian methods of variable seletion in linear regression with standard regularisation methods in a poorly informative context [53] .
JeanMichel Poggi is the supervisor of the PhD Thesis of Robin Genuer since September 2007 dedicated to Random Forests and related algorithms for variable selection in regression or classification. Random Forest, due to Leo Breiman in 2001, proceeds by aggregation decision trees according to two random perturbations. The first one perturbs the learning sample according to the bootstrap principle and the second one acts on the covariate space by choosing randomly a small number of explanatory variables to split a tree node. Surprisingly, this algorithm is extremely powerful for regression and classification problems, not only for prediction but also for variable selection purposes. The PhD thesis is articulated following three directions:

The preliminary theoretical direction concerns mathematical understanding of the reasons of this amazing behaviour.

The second methodological direction aims at improving the knowledge about how to tune the parameters. It includes computer intensive simulations and comparisons based on wellknown real data sets.

The last one is of applied nature and takes place on the joint working group between select and Neurospin (INRIA, CEA) dedicated to statistical methods for fMRI new data in order to improve knowledge about brain activities. It aims to develop adhoc variable selection strategies.
In [41] , Robin Genuer and Vincent Michel present a new approach for the prediction of a behavioral variable from Functional Magnetic Resonance Imaging (fMRI) data. The difficulty comes from the huge number of image voxels that may provide relevant information with respect to the limited number of available images. Based on Random Forests, the approach provides an accurate autocalibrated framework for selecting a reduced set of jointly informative regions.