## Section: Scientific Foundations

Keywords : non parametric, boundary estimation, kernel methods, dimension reduction, extreme value analysis.

### Functional Inference, semi and non parametric methods

Participants : Laurent Gardes, Stéphane Girard.

We also consider methods which do not assume a parametric model.
The approaches are non-parametric in the sense that they do not
require the assumption of a prior model on the unknown
quantities. This property is important since, for image applications
for instance, it is very difficult to introduce sufficiently
general parametric models because of the wide variety of
image contents.
As an illustration, the grey-levels surface
in an image cannot usually be described through a simple
mathematical equation. Projection methods are then a way to
decompose the unknown signal or image on a set of functions (*e.g.* wavelets). Kernel methods which rely on smoothing the data
using a set of kernels (usually probability distributions), are
other examples. Relationships exist between these methods and
learning techniques using Support Vector Machine (SVM) as this
appears in the context of *boundary estimation* and
*image segmentation* .
Such non parametric methods have become the cornerstone when
dealing with functional data [47] . This is the
case for instance when observations are curves. They allow to
model the data without a discretization step.
More generally, these techniques are of great use for dimension reduction
purposes. They permit to reduce the dimension of the functional or
multivariate data without assumptions on the observations distribution.
Semi parametric methods refer to methods that include both parametric and non parametric aspects.
This is the case in *extreme value analysis* [46] , which is
based on the modelling of distribution tails.
It differs from traditionnal statistics
which focus on the central part of distributions, *i.e.* on the most probable events. Extreme value theory shows that
distributions tails can be modelled by both a functional part
and a real parameter, the extreme value index.
As another example, relationships exist between multiresolution
analysis and parametric Markov tree models.

#### Modelling extremal events

Extreme value theory is a branch of statistics dealing with the extreme
deviations from the bulk of probability distributions.
More specifically, it focuses on the limiting distributions for the
minimum or the maximum of a large collection of random observations
from the same arbitrary distribution.
Let x_{1}...x_{n} denote n ordered observations
from a random variable X representing some quantity of interest. A
p_{n} -quantile of X is the value q_{pn} such that the probability
that X is greater than q_{pn} is
p_{n} , *i.e.* P(X>q_{pn}) = p_{n} . When p_{n}<1/n , such a
quantile is said to
be extreme since it is usually greater than the maximum observation
x_{n} (see Figure 1 ).
To estimate such quantiles requires therefore specific
methods [44] , [43] to
extrapolate information beyond the observed values of X. Those methods
are based on Extreme value theory.
This kind of issues appeared in hydrology. One objective was to assess
risk for highly unusual events, such as 100-year floods, starting from
flows measured over 50 years. More generally, the problems that we
address are part of the risk management theory. For instance, in
reliability, the distributions of interest are included in a
semi-parametric family whose tails are decreasing exponentially fast [50] .
These so-called Weibull-tail distributions [52] , [49] , [8] [21] are defined by their
survival distribution function:

where both >0 and the function (x) are unknown. Gaussian, gamma, exponential and Weibull distributions, among others, are included in this family. The function (x) acts as a nuisance parameter which yields a bias in the classical extreme-value estimators developped so far.

#### Boundary estimation

Boundary estimation, or more generally, level sets estimation is a recurrent problem in statistics which is linked to outlier detection. In biology, one is interested in estimating reference curves, that is to say curves which bound 90% (for example) of the population. Points outside this bound are considered as outliers compared to the reference population. In image analysis, the boundary estimation problem arises in image segmentation as well as in supervised learning.

#### Dimension reduction

Our work on high dimensional data includes non parametric aspects. They are related to Principal Component Analysis (PCA) which is traditionnaly used to reduce dimension in data. However, standard linear PCA can be quite inefficient on image data where even simple image distorsions can lead to highly non linear data. When dealing with classification problems, our main project is then to adapt the non linear PCA method proposed in [51] ,[10] . This method (first introduced in Stéphane Girard's PhD thesis) relies on the approximation of datasets by manifolds, generalizing the PCA linear subspaces. This approach reveals good performances when data are images [4] .

Our work also include parametric approaches in particular when considering classification and learning issues. In high dimensional spaces learning methods suffer from the curse of dimensionality: even for large datasets, large parts of the spaces are left empty. One of our approach is therefore to develop new Gaussian models of high dimensional data for parametric inference. Such models can then be used in a Mixtures or Markov framework for classification purposes.