## Section: New Results

### Statistical learning methodology and theory

Participants : Gilles Celeux, Pascal Massart, Vincent Vandewalle, Jean-Michel Poggi.

Gilles Celeux and Jean-Patrick Baudry, in collaboration with Adrian Rafetry, Kenneth Lo and Raphael Gottardo [64] , proposed a methodology for clustering which is an attempt to take advantage of mixture models both for their usefulness in clustering and for their good approximation properties — by modeling each class itself as a mixture. The question which has to be answered is then how to choose the mixture components to gather. They proposed to consider a criterion based on an entropy measure of the obtained classification quality. The resulting hierachical methodology is notably illustrated by its application to a flow cytometry data set.

In collaboration with Christophe Biernacki (Université de Lille) and Gérard Govaert (UTC Compiègne),
Gilles Celeux propose non asymptotic version of integrated likelihoods for the latent class model or
multivariate multinomial mixture model.
They exploit the fact that a fully Bayesian
analysis with Jeffreys non informative prior distributions does not involve technical
difficulty to propose an exact expression of the integrated *complete-data* likelihood,
which is known as being a meaningful model selection criterion in a clustering
perspective. Moreover, this year they propose importance sampling strategies taking into acount the so-called label
switching problem to get efficient approximations of the integrated *observed-data*
likelihood.

Vincent Vandewalle will defend his PhD these in december 2009 about semi-supervised model-based classification under the supervision of Christophe Biernacki (Université de Lille), Gilles Celeux and Gérard Govaert(UTC). His thesis focused on the discriminant analysis situation which is of main interest for applications. Firstly, he designed an hypothesis test to take profit of unlabeled data to decide if a classification model is reliable. Then, he has conceived and investigated specific information based criteria for model selection in the semi-supervised setting. Conceived in the same spirit than the BEC criterion of Bouchard and Celeux (2006)(IEEE on PAMI) by taking into account the classification purpose, he proposed an AIC-like criterion which behaves slighly better in practice and has better theoretical features. In the supervised setting, he proposed an alternative AIC-like criterion which penalises the conditionnal data likelihood by the number of independent parameters involved in the conditionnal likelihood when a generative model is learned [45] . This criterion has shown a promising behavior and is cheap.

Jean-Michel Poggi proposed a procedure for detecting outliers in regression problems. It is based on information provided by boosting regression trees. The key idea is to select the most frequently resampled observation along the boosting iterations and reiterate after removing it. The selection criterion is based on Tchebychev's inequality applied to the maximum over the boosting iterations of the average number of appearances in bootstrap samples. Thus, the procedure is noise distribution free. A lot of well-known bench data sets are considered and a comparative study against two well-known competitors allows to show the interest of the method.