## Project-Team : modbio

## Section: Scientific Foundations

## Statistical learning

Statistical learning theory [48] is one of the fields of inferential statistics the bases of which have been established by V.N. Vapnik in the late 1960s. The goal of this theory is to specify the conditions under which it is possible to «learn» from empirical data obtained by random sampling. Learning amounts to solving a problem of model selection. More precisely, given a problem characterized by a joint probability distribution on couples made up of observations and labels, and a set of functions, of cardinality ordinarily infinite, the goal is to find in the set a function with optimal performance. Problems may belong to one of the three following areas: pattern recognition (discriminant analysis), function approximation (regression) and density estimation.

This theory considers more specifically two inductive principles. The first one, named empirical risk minimization (ERM) principle, consists in minimizing the training error. If the sample is small, one substitutes to this the structural risk minimization (SRM) principle. It consists in minimizing an upper bound on the expected risk (generalization error), a bound sometimes called a guaranteed risk. This latter principle is implemented in the training algorithms of the support vector machines (SVMs), which currently constitute the state-of-the-art for numerous problems of pattern recognition.

SVMs are connectionist models conceived to compute indicator functions, to perform regression or to estimate densities. They have been introduced during the last decade by Vapnik and co-workers [32], as nonlinear extensions of the maximal margin hyperplane [47]. Their main advantage is that they can avoid overfitting in the case where the size of the sample is small [48][29].