Section: New Results
Model selection in Regression and Classification
Participants : Jean-Patrick Baudry, Gilles Celeux, Mohammed El Anbari, Robin Genuer, Pascal Massart, Cathy Maugis, Bertrand Michel, Jean-Michel Poggi, Vincent Vandewalle, Nicolas Verzelen.
In collaboration with Marie-Laure Martin-Magniette (URGV et UMR AgroParisTech/INRA MIA 518), Gilles Celeux and Cathy Maugis developed a variable selection procedure for model-based clustering in [18] . The problem is regarded as a model selection problem in the model-based cluster analysis context. They proposed a more versatile variable selection model taking into account three possible roles for each variable: The relevant clustering variables, the irrelevant clustering variables dependent on a part of the relevant clustering variables and the irrelevant clustering variables totally independent of all the relevant variables. This modelling allows to generalize the model of [17] and the one of Raftery and Dean (2006). A model selection criterion based on BIC and a variable selection algorithm called SelvarClustIndep , embedding two backward stepwise variable selection algorithms for clustering and linear regression, are derived for this new variable role modelling. The model identifiability and the consistency of the variable selection criterion are also established. Numerical experiments highlight the interest of this new modeling. Two softwares, so-called SelvarClust and SelvarClustIndep , implemented en C + + are devoted to the variable selection in model-based clustering according to the modelling of [17] and [18] respectively. They are available at the following address: http://www.math.univ-toulouse.fr/~maugis . Currently, those researchers are interested in taking advantage of this general variable role modelling for discriminant analysis. In particular, the variable selection gives a new interest for the quadratic discriminant analysis.
These variable selection procedures are in particular used for genomics applications which is the result of a collaboration with researchers of of URGV (Evry Genopole).
Cathy Maugis with Bertrand Michel (GEOMETRICA, Inria) consider specific Gaussian mixtures to solve simultaneously variable selection and clustering problems. In [19] , they proposed a non asymptotic penalized criterion to choose the number of mixture components and the relevant variable subset. Because of the non linearity of the associated Kullback-Leibler contrast on Gaussian mixtures, a general model selection theorem for MLE proposed by Massart is used to obtain the penalty function form and the associated oracle inequality. This theorem requires controlling the bracketing entropy of mixture families. Nevertheless, these theoretical results depend on unknown constants. Currently, they are interested in establishing an adaptative property of their penalized maximum likelihood estimators in a minimax sense. In [67] , they study the practical use of their penalized criterion. A "slope heuristics" method is applied to calibrate these constants. Jean-Patrick Baudry, Cathy Maugis and Bertrand Michel [49] are developping a Matlab package for the use of the slope heuristics of Birgé and Massart (dimension jump, slope estimation, ...). The aim is twofold: first to propose solutions to overcome the practical difficulties involved by its practical application and second to provide a ready-to-use and easy solution for people who may want to try to apply the slope heuristics and then to encourage its use. They are preparing an overview about the slope heuristics to introduce this package.
With Sylvain Arlot (ENS, CNRS), Pascal Massart [7] studied the so-called slope heuristics in the framework of regression on a random design, with possible heteroscedastic noise. Assuming that all the models are made of histograms, they show the same relationship between a “minimal penalty” and an optimal one. This can for instance be used for tuning a penalty, when the optimal penalty is known up to some multiplicative constant. In general, the optimal shape of the penalty can be estimated by v -fold or resampling penalties. Their work is based on new structural concentration inequalities for the empirical risk and the high-dimensional Wilks phenomenon enlightened in [9] .
Jean-Patrick Baudry and Gilles Celeux [1] continued the study of estimation and model selection procedures derived by minimizing a new contrast adapted for clustering with mixture models and inspired from the ICL criterion. Theoretical results have been obtained about the consistency of the estimator thus defined and about the consistency of the corresponding model selection procedure. Moreover, solutions have been developped to practically compute this estimator, which involves difficulties analagous to those arising when computing the usual maximum likelihood estimator with mixture models. Those solutions may improve the results of the EM algorithm in this usual task, too. They also studied some robustness properties of the proposed estimator.
Jean-Patrick Baudry and Gilles Celeux, in collaboration with Ana Maria Ferreira (Lisbon University), proposed a model selection criterion which can be helpful when it can be interesting to find a solution well-related to an external classification available a priori . This criterion has been applied to a data set in the professional development field.
In collaboration with Professor Abdallah Mkhadri (University of Marrakesh, Marocco), Gilles Celeux supervised the thesis of Mohammed El Anbari which concern regularisation methods in linear regression. This year, in collaborarion with Jean-Michel Marin (Université de Montpellier) they have considered the Bayesian point of view and compared Bayesian methods of variable seletion in linear regression with standard regularisation methods in a poorly informative context [53] .
Jean-Michel Poggi is the supervisor of the PhD Thesis of Robin Genuer since September 2007 dedicated to Random Forests and related algorithms for variable selection in regression or classification. Random Forest, due to Leo Breiman in 2001, proceeds by aggregation decision trees according to two random perturbations. The first one perturbs the learning sample according to the bootstrap principle and the second one acts on the covariate space by choosing randomly a small number of explanatory variables to split a tree node. Surprisingly, this algorithm is extremely powerful for regression and classification problems, not only for prediction but also for variable selection purposes. The PhD thesis is articulated following three directions:
-
The preliminary theoretical direction concerns mathematical understanding of the reasons of this amazing behaviour.
-
The second methodological direction aims at improving the knowledge about how to tune the parameters. It includes computer intensive simulations and comparisons based on well-known real data sets.
-
The last one is of applied nature and takes place on the joint working group between select and Neurospin (INRIA, CEA) dedicated to statistical methods for fMRI new data in order to improve knowledge about brain activities. It aims to develop ad-hoc variable selection strategies.
In [41] , Robin Genuer and Vincent Michel present a new approach for the prediction of a behavioral variable from Functional Magnetic Resonance Imaging (fMRI) data. The difficulty comes from the huge number of image voxels that may provide relevant information with respect to the limited number of available images. Based on Random Forests, the approach provides an accurate auto-calibrated framework for selecting a reduced set of jointly informative regions.