The team mistisaims at developing statistical methods for dealing with complex problems or data. Our applications consist mainly of image processing and spatial data problems with some applications in biology and medicine. Our approach is based on the statement that complexity can be handled by working up from simple local assumptions in a coherent way, defining a structured model, and that is the key to modelling, computation, inference and interpretation. The methods we focus on involve mixture models, Markov models, and more generally hidden structure models identified by stochastic algorithms on one hand, and semi and nonparametric methods on the other hand.
Hidden structure models are useful for taking into account heterogeneity in data. They concern many areas of statistical methodology (finite mixture analysis, hidden Markov models, random effect models, ...). Due to their missing data structure, they induce specific difficulties for both estimating the model parameters and assessing performance. The team focuses on research regarding both aspects. We design specific algorithms for estimating the parameters of missing structure models and we propose and study specific criteria for choosing the most relevant missing structure models in several contexts.
Semi and nonparametric methods are relevant and useful when no appropriate parametric model exists for the data under study either because of data complexity, or because information is missing. The focus is on functions describing curves or surfaces or more generally manifolds rather than real valued parameters. This can be interesting in image processing for instance where it can be difficult to introduce parametric models that are general enough (e.g. for contours).
In a first approach, we consider statistical parametric models,
being the parameter possibly multidimensional usually unknown and to be estimated. We consider cases where the data naturally divide into observed data
y=
y_{1}, ...,
y_{n}and unobserved or missing data
z=
z_{1}, ...,
z_{n}. The missing data
z_{i}represents for instance the memberships to one of a set of
Kalternative categories. The distribution of an observed
y_{i}can be written as a finite mixture of distributions,
These models are interesting in that they may point out an hidden variable responsible for most of the observed variability and so that the observed variables are conditionallyindependent. Their estimation is often difficult due to the missing data. The ExpectationMaximization (EM) algorithm is a general and now standard approach to maximization of the likelihood in missing data problems. It provides parameters estimation but also values for missing data.
Mixture models correspond to independent
z_{i}'s. They are more and more used in statistical pattern recognition. They allow a formal (modelbased) approach to (unsupervised) clustering.
Graphical modelling provides a diagrammatic representation of the logical structure of a joint probability distribution, in the form of a network or graph depicting the local relations among variables. The graph can have directed or undirected links or edges between the nodes, which represent the individual variables. Associated with the graph are various Markov properties that specify how the graph encodes conditional independence assumptions.
It is the conditional independence assumptions that give the graphical models their fundamental modular structure, enabling computation of globally interesting quantities from local specifications. In this way graphical models form an essential basis for our methodologies based on structures.
The graphs can be either directed, e.g. Bayesian Networks, or undirected, e.g. Markov Random Fields. The specificity of Markovian models is that the dependencies between the nodes are
limited to the nearest neighbor nodes. The neighborhood definition can vary and be adapted to the problem of interest. When parts of the variables (nodes) are not observed or missing, we refer
to these models as Hidden Markov Models (HMM). Hidden Markov chains or hidden Markov fields correspond to cases where the
z_{i}'s in (
) are distributed according to a Markov chain or a Markov field. They are natural extension of mixture models. They are widely
used in signal processing (speech recognition, genome sequence analysis) and in image processing (remote sensing, MRI, etc.). Such models are very flexible in practice and can naturally account
for the phenomena to be studied.
They are very useful in modelling spatial dependencies but these dependencies and the possible existence of hidden variables are also responsible for a typically large amount of computation. It follows that the statistical analysis may not be straightforward. Typical issues are related to the neighborhood structure to be chosen when not dictated by the context and the possible high dimensionality of the observations. This also requires a good understanding of the role of each parameter and methods to tune them depending on the goal in mind. As regards, estimation algorithms, they correspond to an energy minimization problem which is NPhard and usually performed through approximation. We focus on a certain type of methods based on the mean field principle and propose effective algorithms which show good performance in practice and for which we also study theoretical properties. We also propose some tools for model selection. Eventually we investigate ways to extend the standard Hidden Markov Field model to increase its modelling power.
We also consider methods which do not assume a parametric model. The approaches are nonparametric in the sense that they do not require the assumption of a prior model on the unknown quantities. This property is important since, for image applications for instance, it is very difficult to introduce sufficiently general parametric models because of the wide variety of image contents. Projection methods are then a way to decompose the unknown quantity on a set of functions ( e.g.wavelets). Kernel methods which rely on smoothing the data using a set of kernels (usually probability distributions), are other examples. Relationships exist between these methods and learning techniques using Support Vector Machine (SVM) as this appears in the context of levelsets estimation, see section . Such nonparametric methods have become the cornerstone when dealing with functional data . This is the case for instance when observations are curves. They allow to model the data without a discretization step. More generally, these techniques are of great use for dimension reductionpurposes (section ). They permit to reduce the dimension of the functional or multivariate data without assumptions on the observations distribution. Semiparametric methods refer to methods that include both parametric and nonparametric aspects. Examples include the Sliced Inverse Regression (SIR) method which combines nonparametric regression techniques with parametric dimension reduction aspects. This is also the case in extreme value analysis , which is based on the modelling of distribution tails, see section . It differs from traditionnal statistics which focus on the central part of distributions, i.e.on the most probable events. Extreme value theory shows that distributions tails can be modelled by both a functional part and a real parameter, the extreme value index.
Extreme value theory is a branch of statistics dealing with the extreme deviations from the bulk of probability distributions. More specifically, it focuses on the limiting distributions
for the minimum or the maximum of a large collection of random observations from the same arbitrary distribution. Let
X_{1,
n}...
X_{n,
n}denote
nordered observations from a random variable
Xrepresenting some quantity of interest. A
p_{n}quantile of
Xis the value
x_{pn}such that the probability that
Xis greater than
x_{pn}is
p_{n},
i.e.
P(
X>
x
_{pn}) =
p
_{n}. When
p_{n}<1/
n, such a quantile is said to be extreme since it is usually greater than the maximum observation
X_{n,
n}(see Figure
).
To estimate such quantiles requires therefore dedicated methods to extrapolate information beyond the observed values of
X. Those methods are based on Extreme value theory. This kind of issues appeared in hydrology. One objective was to assess risk for highly unusual events, such as 100year floods,
starting from flows measured over 50 years. To this end, semiparametric models of the tail are considered:
where both the extremevalue index
>0and the function
(
x)are unknown. The function
(
x)acts as a nuisance parameter which yields a bias in the classical extremevalue estimators developped so far. Such models are often refered to as heavytail models
since the probability of extreme events decreases at a polynomial rate to zero. More generally, the problems that we address are part of the risk management theory. For instance, in
reliability, the distributions of interest are included in a semiparametric family whose tails are decreasing exponentially fast. These socalled Weibulltail distributions
are defined by their survival distribution function:
Gaussian, gamma, exponential and Weibull distributions, among others, are included in this family. An important part of our work consists in establishing links between
models (
) and (
) in order to propose new estimation methods. We also consider the case where the observations were recorded with a covariate
information. In this case, the extremevalue index and the
p_{n}quantile are functions of the covariate. We propose estimators of these functions by using a moving window approach.
Level sets estimation is a recurrent problem in statistics which is linked to outlier detection. In biology, one is interested in estimating reference curves, that is to say curves which bound 90%(for example) of the population. Points outside this bound are considered as outliers compared to the reference population. Level sets estimation can be looked at as a conditional quantile estimation problem which permits to benefit from a nonparametric statistical framework. In particular, boundary estimation, arising in image segmentation as well as in supervised learning, is interpreted as an extreme levelset estimation problem. Level sets estimation can also be formulated as a linear programming problem . In this context, estimates are sparse since they involve only a small fraction of the dataset, called the set of support vectors.
Our work on high dimensional data imposes to face the curse of dimensionality phenomenon. Indeed, the modelling of high dimensional data requires complex models and thus the estimation of high number of parameters compared to the sample size. In this framework, dimension reduction methods aim at replacing the original variables by a small number of linear combinations with as small as possible loss of information. Principal Component Analysis (PCA) is the most widely used method to reduce dimension in data. However, standard linear PCA can be quite inefficient on image data where even simple image distorsions can lead to highly non linear data. Two directions are investigated. First, nonlinear PCAs can be proposed, leading to semiparametric dimension reduction methods . Another field of investigation is to take into account the application goal in the dimension reduction step. One of our approaches is therefore to develop new Gaussian models of high dimensional data for parametric inference . Such models can then be used in a Mixtures or Markov framework for classification purposes. Another approaches consists in combining dimension reduction, regularization techniques and regression techniques to improve the Sliced Inverse Regression method .
As regards applications, several areas of image analysis can be covered using the tools developed in the team. More specifically, we address in collaboration with team Lear issues about object and class recognition and about the extraction of visual information from large image data bases. In collaboration with team Perception, we also address various issues in computer vision involving Bayesian modelling and probabilistic clustering techniques. Other applications in medical imaging are natural. We work more specifically on MRI data. We also consider other statistical 2D fields coming from other domains such as remote sensing. Also, in the context of the ANR MDCO project, see section , we work on hyperspectral multiangle images.
A second domain of applications concerns biomedical statistics and molecular biology. We consider the use of missing data models in population genetics. We also investigate statistical tools for the analysis of bacterial genomes beyond gene detection. Applications in agronomy and epidemiology are also considered. Finally, in the context of the ANR VMC project, see section , we plan to study the uncertainties on the forecasting and climate projection for Mediterranean highimpact weather events.
Reliability and industrial lifetime analysis are applications developed through collaborations with the EDF research department and the LCFR laboratory (Laboratoire de Conduite et Fiabilité des Réacteurs) of CEA / Cadarache. We also consider failure detection in print infrastructure through collaborations with Xerox, Meylan and the CIFRE PhD thesis of Laurent Donini, coadvised by JeanBaptiste Durand and Stéphane Girard.
Joint work with:Charles Bouveyron (Université Paris 1) and Gilles Celeux (Select, INRIA). The HighDimensional Discriminant Analysis (HDDA) and the HighDimensional Data Clustering (HDDC) toolboxes contain respectively efficient supervised and unsupervised classifiers for highdimensional data. These classifiers are based on Gaussian models adapted for highdimensional data . The HDDA and HDDC toolboxes are available for Matlab and are included into the software MixMod .
Joint work with:Diebolt, J. (CNRS) and Garrido, M. (INRA ClermontFerrand).
The
Extremessoftware is a toolbox dedicated to the modelling of extremal events offering extreme quantile estimation procedures and model selection
methods. This software results from a collaboration with EDF R&D. It is also a consequence of the PhD thesis work of Myriam Garrido
. The software is written in C++ with a Matlab graphical interface. It is now available both on Windows and Linux
environments. It can be downloaded at the following URL:
http://
The SpaCEM ^{3}(Spatial Clustering with EM and Markov Models) program replaces the former, still available, SEMMS (Spatial EM for Markovian Segmentation) program developed with Nathalie Peyrard from INRA Avignon.
SpaCEM ^{3}proposes a variety of algorithms for image segmentation, supervised and unsupervised classification of multidimensional and spatially located data. The main techniques use the EM algorithm for soft clustering and Markov Random Fields for spatial modelling. The learning and inference parts are based on recent developments based on mean field approximations. The main functionalities of the program include:
The former SEMMS functionalities, ie.
Model based unsupervised image segmentation, including the following models: Hidden Markov Random Field and mixture model;
Model selection for the Hidden Markov Random Field model;
Simulation of commonly used Hidden Markov Random Field models (Potts models).
Simulation of an independent Gaussian noise for the simulation of noisy images.
And additional possibilities such as,
New Markov models including various extensions of the Potts model and triplets Markov models;
Additional treatment of very high dimensional data using dimension reduction techniques within a classification framework;
Models and methods allowing supervised classification with new learning and test steps.
The SEMMS package, written in C, is publicly available at:
http://
Joint work with:Francois, O. (TimB, TIMC) and Chen, C. (former Postdoctoral fellow in Mistis).
The FASTRUCT program is dedicated to the modelling and inference of population structure from genetic data. Bayesian modelbased clustering programs have gained increased popularity in studies of population structure since the publication of the software STRUCTURE . These programs are generally acknowledged as performing well, but their runningtime may be prohibitive. FASTRUCT is a nonBayesian implementation of the classical model with noadmixture uncorrelated allele frequencies. This new program relies on the ExpectationMaximization principle, and produces assignment rivaling other modelbased clustering programs. In addition, it can be severalfold faster than Bayesian implementations. The software consists of a commandline engine, which is suitable for batchanalysis of data, and a MS Windows graphical interface, which is convenient for exploring data.
It is written for Windows OS and contains a detailed user's guide. It is available at
http://
The functionalities are further described in the related publication:
Molecular Ecology Notes 2006 .
Joint work with:Francois, O. (TimB, TIMC) and Chen, C. (former postdoctoral fellow in Mistis).
TESS is a computer program that implements a Bayesian clustering algorithm for spatial population genetics. Is it particularly useful for seeking genetic barriers or genetic discontinuities in continuous populations. The method is based on a hierarchical mixture model where the prior distribution on cluster labels is defined as a Hidden Markov Random Field . Given individual geographical locations, the program seeks population structure from multilocus genotypes without assuming predefined populations. TESS takes input data files in a format compatible to existing nonspatial Bayesian algorithms (e.g. STRUCTURE). It returns graphical displays of cluster membership probabilities and geographical cluster assignments from its Graphical User Interface.
The functionalities and the comparison with three other Bayesian Clustering programs are specified in the following publication:
Molecular Ecology Notes 2007
Joint work with:Bouveyron, C (Université Paris 1) and Celeux, G. (Select, INRIA).
In the PhD work of Charles Bouveyron (coadvised by Cordelia Schmid from the INRIA team LEAR) , we propose new Gaussian models of high dimensional data for classification purposes. We assume that the data live in several groups located in subspaces of lower dimensions. Two different strategies arise:
the introduction in the model of a dimension reduction constraint for each group,
the use of parsimonious models obtained by imposing to different groups to share the same values of some parameters.
This modelling yields a new supervised classification method called HDDA for High Dimensional Discriminant Analysis . Some versions of this method have been tested on the supervised classification of objects in images. This approach has been adapted to the unsupervised classification framework, and the related method is named HDDC for High Dimensional Data Clustering . In collaboration with Gilles Celeux and Charles Bouveyron we are currently working on the automatic selection of the discrete parameters of the model. Another part of the work of Charles Bouveyron and Stéphane Girard consists in extending this case to the semisupervised context or to the presence of label noise.
Joint work with:Arnaud, E., Hansard, M., Horaud, R. and Narasimha, R. from the INRIA team Perception.
Geometric and probabilistic fusion of spatial visual and auditory cues.We first explain how we can combine spatial visual and auditory cues in a geometric and probabilistic framework.
This is done in order to address the issues of detecting and localizing objects in a scene that are both seen and heard. To do so, we used binaural and binocular sensors for gathering
auditory and visual observations. It is shown that the detection and localization problem can be recast as the task of clustering the audiovisual observations into coherent groups. The
proposed probabilistic generative model captures the relations between audio and visual observations. This model maps the data into a common audiovisual 3D representation via a pair of
mixture models. The statistical method of choice for solving this problem is cluster analysis. We rely on lowlevel audio and video features which makes our model more general and less
dependent on supervised learning techniques, such as face and speech detectors. The input data consists of M visual observations
f= {
f_{1}, ...,
f_{m}, ...,
f_{M}}, and K auditory observations
g= {
g_{1}, ...,
g_{k}, ...,
g_{K}}. This data is recorded over a time interval
[
t
_{1},
t
_{2}], which is short enough to ensure that the audiovisual (AV) objects responsible for
fand
gare effectively stationary in space. Then we address the estimation of the AV object sites
S= {
s_{1}, ...,
s_{n}, ...,
s_{N}}, where each
s_{n}is described by its 3D coordinates
(
x
_{n},
y
_{n},
z
_{n})
^{T}. Note that in general
Nis unknown. A visual observation
f_{m}is a 3D binocular coordinate
(
u
_{m},
v
_{m},
d
_{m})
^{T}, where
uand
vdenote the 2D location in the Cyclopean image. The scalar
ddenotes the binocular disparity at
(
u,
v)
^{T}. Hence, Cyclopean coordinates
(
u,
v,
d)
^{T}are associated with each point
s= (
x,
y,
z)
^{T}in the visible scene. We define a function
F:
R^{3}R^{3}that maps
Sonto
f. An auditory observation
g_{k}is represented by an auditory disparity, namely the interaural time difference, or ITD. To relate a location to an ITD value we define a function
G:
R^{3}Rthat maps
Son
g. Given an observed ITD we can deduce the surface that should contain the source.
We address the problem of AV localization in the framework of unsupervised clustering. The rationale is that observations form groups that correspond to the different AV objects in the
scene. So the problem is recast as a clustering task: an assignment of each observation to one of the clusters should be performed as well as the estimation of cluster parameters, which
include the N 3D positions
s_{n}of AV objects. To account for the presence of observations that are not related to any AV object, we introduce an additional background (outlier) class. Because of the different nature
of the observations, clustering is performed via two mixture models respectively in the audio (1D) and video (3D) observation spaces, subject to the common parametrization provided by the
positions
s_{n}. The next step is to devise a procedure that finds the best values for the assignments and for the parameters. One possibility is to use a version of the EM algorithm, as it is
explained below.
Development of statistical methods for crossmodal integration.Given the probabilistic model defined above, we wish to determine the AV objects that generated the visual and auditory
observations, that is to derive values of assignment vectors together with the AV object position vectors
S(which are part of our model unknown parameters). Direct maximum likelihood estimation of mixture models is usually difficult, due to the missing assignments. The Expectation
Maximization (EM) algorithm is a general and now standard approach to maximization of the likelihood in missing data problems. In our specific context, difficulties arise from the fact that
it is necessary to perform simultaneous optimization in two different observation spaces, auditory and visual. It involves solving a system of nonlinear equations which does not yield a
closed form solution and the traditional EM algorithm cannot be performed. As an alternative, we considered instances of the Generalized EM (GEM) algorithm which is more flexible and provided
good results in our experiments. This work has been published in the ICMI'08 conference
where more details as well as experiments can be found.
Joint work with:Scherrer, B. and Dojat, M (Grenoble Institute of Neuroscience).
Clustering is a fundamental data analysis step that consists in producing a partionning of the individuals to account for the groups existing in the observed data. In this paper, we introduce an additional cooperative aspect and propose a framework for more general tasks. We address cases in which the goal is to produce not a single partionning but two or more partionnings using cooperation between them. Cooperation is expressed by assuming the existence of two sets of missing assignment variables, representing two sets of labels which are not independent but related in the sense that information on one of them is useful to find the other one. We consider non trivial situations in which Markov random field models are used to deal with additional interactions including dependencies between labels within each label sets. We show that our cooperative setting can be formulated in terms of conditional models and propose then to simplify inference into alternating and cooperative estimation procedures based on variants of the Expectation Maximization (EM) algorithm. We illustrate the advantages of our approach by showing its ability to deal successfully with the complex task of segmenting simultaneously and cooperatively tissues and structures from MRI brain scans. In particular this framework is used in the work described in the next section.
Joint work with:Scherrer, B., Dojat, M. (Grenoble Institute of Neuroscience) and Garbay, C. (LIG).
Difficulties in automatic MR brain scan segmentation arise from various sources. The nonuniformity of image intensity results in spatial intensity variations within each tissue, which is a major obstacle to an accurate automatic tissue segmentation. The automatic segmentation of subcortical structures is a challenging task as well. It cannot be performed based only on intensity distributions and requires the introduction of a prioriknowledge. Most of the proposed approaches share two main characteristics. First, tissue and subcortical structure segmentations are considered as two successive tasks and treated relatively independently although they are clearly linked: a structure is composed of a specific tissue, and knowledge about structures locations provides valuable information about local intensity distribution for a given tissue. Second, tissue models are estimated globally through the entire volume and then suffer from imperfections at a local level. Alternative local procedures exist but are either used as a preprocessing step or use redundant information to ensure consistency of local models. Recently, we reported good results using an innovative local and cooperative approach . It performs tissue and subcortical structure segmentation by distributing through the volume a set of local Markov Random Field (MRF) models which better reflect local intensity distributions. Local MRF models are used alternatively for tissue and structure segmentations. Although satisfying in practice, these tissue and structure MRF's do not correspond to a valid joint probabilistic model and are not compatible in that sense. As a consequence, important issues such as convergence or other theoretical properties of the resulting local procedure cannot be addressed. In addition, in , cooperation mechanisms between local models are somewhat arbitrary and independent of the MRF models themselves. Our contribution is then to propose a fully Bayesian framework in which we define a joint model that links local tissue and structure segmentations but also the model parameters so that both types of cooperations, between tissues and structures and between local models, are deduced from the joint model and optimal in that sense. Our model has the following main features: 1) cooperative segmentation of both tissues and structures is encoded via a joint probabilistic model specified through conditional MRF models which capture the relations between tissues and structures. This model specifications also integrate external a prioriknowledge in a natural way; 2) intensity nonuniformity is handled by using a specific parametrization of tissue intensity distributions which induces local estimations on subvolumes of the entire volume; 3) global consistency between local estimations is automatically ensured by using a MRF spatial prior for the intensity distributions parameters. Estimation within our framework is defined as a Maximum A Posteriori (MAP) estimation problem and is carried out by adopting an instance of the Expectation Maximization (EM) algorithm. We show that such a setting can adapt well to our conditional models formulation and simplifies into alternating and cooperative estimation procedures for standard Hidden MRF models. The approach is implemented using a multiagent framework where each agent computes a local MRF model and cooperates with its neighboring agents for model refinement. The evaluation performed using a previously linearly registered atlas of 17 structures show good results. An illustration is given in Figure .
Joint work with:Scherrer, B. Dojat, M. (Grenoble Institute of Neuroscience) and Garbay, C. (LIG).
The analysis of MR brain scans is a complex task that is further complicated if the observed data are themselves multidimensional as it is the case when several MR channels can provide complementary information and are considered simultaneously. Usually healthy subjects data do not address the same issues as pathological data. This type of data rarely allows the use of automatic or generic approaches. Our goal is to extend our current framework to MRIs with Multiple Sclerosis lesions and stroke lesions. We address the issue of fusing the output of multiple MR sequences to robustly and accurately segment brain lesions. A key capability for radiologists is to delineate lesions out from the rest of the brain tissues. To achieve this goal, radiologists make usually use of multiple MR sequences. The use of multiple sequences not only provides more measurements when segmenting the brain into regions, but crucially, different sequences may be complementary in that one may succeed when another fails. To achieve the same goal automatically and robustly is not an easy task. Overall system performance may be improved in two main ways, either by enhancing the processing of each individual sequence, or by improving the scheme for integrating the information from the different sequences. The contributions of this work concern the latter. We developed a model in which weights can be introduced to account for the relative importance of each modality and propose a variant of the EM algorithm in a Bayesian framework to estimate these weights iteratively and derive a segmentation of the lesions under consideration. Promising results are observed on patients with Multiple Sclerosis lesions (see Figure ).
Joint work with:Guillou, A. (Univ. Strasbourg), and Diebolt, J. (CNRS, Univ. Marnelavallée).
Our first achievement is the introduction of a new model of tail distributions depending on a function and on an unknown parameter . This model includes very different distribution tail behaviours from the three classical maximum domains of attraction. In the particular cases of Pareto type tails or Weibull tails, our estimators coincide with classical ones proposed in the literature, thus permitting to retrieve their asymptotic normality in an unified way. Our second achievement is the development of new estimators dedicated to Weibulltail distributions ( ): kernel estimators and bias correction through exponential regression , .
Joint work with:Amblard, C. (TimB in TIMC laboratory, Univ. Grenoble 1).
The goal of the PhD thesis of Alexandre Lekina is to contribute to the development of theoretical and algorithmic models to tackle conditional extreme value analysis,
iethe situation where some covariate information
Xis recorded simultaneously with a quantity of interest
Y. In such a case, the tail heaviness of Y depends on X, and thus the tail index as well as the extreme quantiles are also functions of the covariate. We combine nonparametric smoothing
techniques
with extremevalue methods in order to obtain efficient estimators of the conditional tail index
and conditional extreme quantiles
. Conditional extremes are studied in climatology where one is interested in how climate change over years might
affect extreme temperatures or rainfalls. In this case, the covariate is univariate (the time). Bivariate examples include the study of extreme rainfalls as a function of the geographical
location. The application part of the study will be joint work with the LTHE (Laboratoire d'étude des Transferts en Hydrologie et Environnement) located in Grenoble.
More future work will include the study of multivariate extreme values. To this aim, a research on some particular copulas , has been initiated with Cécile Amblard, since they are the key tool for building multivariate distributions .
Joint work with:Daouia, A. (Univ. Toulouse I), Jacob, P. and Menneteau, L. (Univ. Montpellier II).
The boundary bounding the set of points is viewed as the larger level set of the points distribution. This is then an extreme quantile curve estimation problem. We propose estimators based on projection as well as on kernel regression methods applied on the extreme values set , for particular set of points. Our work is to define similar methods based on wavelets expansions in order to estimate nonsmooth boundaries, and on local polynomials estimators to get rid of boundary effects. Besides, we are also working on the extension of our results to more general sets of points. To this end, we focus on the family of conditional heavy tails. An estimator of the conditional tail index has been proposed and the corresponding conditional extreme quantile estimator has been derived . This work has been initiated in the PhD work of Laurent Gardes , codirected by Pierre Jacob and Stéphane Girard and in with the consideration of starshaped supports.
To overcome the curse of dimensionality arising in highdimensional regression problems, one way consists in reducing the problem dimension. To this end, Sliced Inverse Regression (SIR) is an interesting solution. The original method, however, requires the inversion of the predictors covariance matrix. In case of collinearity between these predictors or small sample sizes compared to the dimension, the inversion is not possible and a regularization technique has to be used. We thus develop a new approach , based on a Fisher Lecture given by R.D. Cook where it is shown that SIR axes can be interpreted as solutions of an inverse regression problem. In this paper, a Gaussian prior distribution is introduced on the unknown parameters of the inverse regression problem in order to regularize their estimation. We show that some existing SIR regularizations can enter our framework, which permits a global understanding of these methods. Three new priors are proposed leading to new regularizations of the SIR method.
This technique has been applied in particular in a collaboration with bioMerieux (see Section ). We coadvised the internship of Lamiae Azizi who applied SIR in the context of quantitation procedures developed at bioMerieux.
Joint work with:Perot, N., Devictor, N. and Marquès, M. (CEA).
One of the main activities of the LCFR (Laboratoire de Conduite et Fiabilité des Réacteurs), CEA Cadarache, concerns the probabilistic analysis of some processes using reliability and statistical methods. In this context, probabilistic modelling of steels tenacity in nuclear plants tanks has been developed. The databases under consideration include hundreds of data indexed by temperature, so that, reliable probabilistic models have been obtained for the central part of the distribution. However, in this reliability problem, the key point is to investigate the behaviour of the model in the distribution tail. In particular, we are mainly interested in studying the lowest tenacities when the temperature varies (Figure ).
This work is supported by a research contract (from december 2008 to december 2010) involving mistisand the LCFR.
Joint work with:Molinié, G. from Laboratoire d'Etude des Transferts en Hydrologie et Environnement (LTHE), France.
Extreme rainfalls are generally associated with two different precipitation regimes. Extreme cumulated rainfall over 24 hours results from stratiform clouds on which the relief forcing is of primary importance. Extreme rainfall rates are defined as rainfall rates with low probability of occurrence, typically with higher mean returnlevels than the maximum observed level. For example Figure presents the return levels for the CévennesVivarais region. It is then of primary importance to study the sensitivity of the extreme rainfall estimation to the estimation method considered. A preliminary work on this topic is available in . mistisgot a Ministry grant for a related ANR project (see Section ).
Joint work with:Douté, S. from Laboratoire de Planétologie de Grenoble, France in the context of the VAHINE project (see Section ).
Visible and near infrared imaging spectroscopy is one of the key techniques to detect, to map and to characterize mineral and volatile (eg. waterice) species existing at the surface of
the planets. Indeed the chemical composition, granularity, texture, physical state, etc. of the materials determine the existence and morphology of the absorption bands. The resulting spectra
contain therefore very useful information. Current imaging spectrometers provide data organized as three dimensional hyperspectral images: two spatial dimensions and one spectral dimension.
Our goal is to estimate the functional relationship
Fbetween some observed spectra and some physical parameters. To this end, a database of synthetic spectra is generated by a physical radiative transfer model and used to estimate
F. The high dimension of spectra is reduced by Gaussian regularized sliced inverse regression (GRSIR) to overcome the curse of dimensionality and consequently the sensitivity of the
inversion to noise (illconditioned problems). This method is compared with the more classical SVM approach. GRSIR has the advantage of being very fast, interpretable and accurate. Recall
that SVM approximates the functional
F:
y=
F(
x)using a solution of the form
, where
x_{i}are samples from the training set,
Ka kernel function and
are the parameters of
Fwhich are estimated during the training process. The kernel
Kis used to produce a nonlinear function. The SVM training entails minimization of
with respect to
, and with
if

F(
x)
y
and

F(
x)
y
otherwise. Prior to running the algorithm, the following parameters need to
be fitted:
which controls the resolution of the estimation,
which controls the smoothness of the solution and the kernel parameters (
for the Gaussian kernel).
Joint work with:Douté, S. from Laboratoire de Planétologie de Grenoble, France in the context of the VAHINE project (see Section ).
A new generation of imaging spectrometers is emerging with an additional angular dimension, in addition to the three usual dimensions, two spatial dimensions and one spectral dimension. The surface of the planets will now be observed from different view points on the satellite trajectory, corresponding to about ten different angles, instead of only one corresponding usually to the vertical (0 degree angle) view point. Multiangle imaging spectrometers present several advantages: the influence of the atmosphere on the signal can be better identified and separated from the surface signal on focus, the shape and size of the surface components and the surfaces granularity can be better characterized. However, this new generation of spectrometers also results in a significant increase in the size (several terabits expected) and complexity of the generated data. We started to investigate the use of statistical techniques to deal with these generic sources of complexity in data beyond the traditional tools in mainstream statistical packages.
Preliminary experiments carried out by Camille Neels during her 2 month internship in the team pointed out that, previous to any classification task or other analyses, some preprocessing of the images was required. We pointed out the existence in the data of a socalled spectral smileissue which we are currently trying to correct. Spectral smile refers to an artefact commonly encountered in spectral images acquired with Pushbroomspectrometers. It is due to the fact that the wavelengthchannel association is not constant across the spatial dimension. Regarding classification tasks, it induces artificial inhomogeneities due to sampling issues.
We signed in december 2006 a threeyear CIFRE contract with Xerox, Meylan, regarding the PhD work of Laurent Donini about statistical techniques for mining logs and usage data in a print infrastructure. The thesis is coadvised by Stéphane Girard and JeanBaptiste Durand.
We developed a new collaboration with bioMerieux in Grenoble. We signed a 6 month contract including the coadvising of Lamiae Azizi who was at that time doing an internship at bioMerieux.
We signed a 4 month contract with Veoliaeau in Lyon including the coadvising of Luce Ponsar hired by Veolia for an internship. The goal was to study and possibly detect groups of individuals in time series describing various quantities linked to water consumption and billing in the Lyon area.
mistisparticipates to the weekly statistical seminar of Grenoble, F. Forbes is one of the organizers and several lecturers have been invited in this context.
mistisgot Ministry grants for two projects supported by the French National Research Agency (ANR):
MDCO (Masse de Données et Connaissances) program. This threeyear project is called "Visualisation et analyse d'images hyperspectrales multidimensionnelles en
Astrophysique" (VAHINE). It aims at developing physical as well as mathematical models, algorithms, and software able to deal efficiently with hyperspectral multiangle data but also with
any other kind of large hyperspectral dataset (astronomical or experimental). It involves the Observatoire de la Côte d'Azur (Nice), and several universities (Strasbourg I and Grenoble I).
For more information please visit the associated web site:
http://
VMC (Vulnérabilité : Milieux et climats) program. This threeyear project is called "Forecast and projection in climate scenario of Mediterranean intense events:
Uncertainties and Propagation on environment" (MEDUP) and deals with the quantification and identification of sources of uncertainties associated with the forecast and climate projection
for Mediterranean highimpact weather events. The propagation of these uncertainties on the environment is also considered, as well as how they may combine with the intrinsic uncertainties
of the vulnerability and risk analysis methods. It involves MétéoFrance and several universities (Paris VI, Grenoble I and Toulouse III). (
http://
mistisis also involved into two projects in the Cooperative Research Initiative (ARC) program supported by INRIA:
The ChromoNet project is coordinated by MarieFrance Sagot from team HELIX. It aims at the computational inference and analysis of interchromosomal interaction networks. The additional partners are the SSB (Statistiques des Séquences Biologiques) group at INRA and the Nuclear Organisation team at MRC, Imperial College London.
The SeLMIC project (
http://
F. Forbes and S. Girard are members of the Pascal Network of Excellence.
S. Girard is a member of the European project (Interuniversity Attraction Pole network) “Statistical techniques and modelling for complex substantive questions with complex data”,
Web site :
http://
S. Girard has also joint work with Prof. A. Nazin (Institute of Control Science, Moscow, Russia).
mistisis involved in a European STREP proposal, named POP (Perception On Purpose) coordinated by Radu Horaud from INRIA team Perception. The threeyear project started in January 2006. Its objective is to put forward the modelling of perception (visual and auditory) as a complex attentional mechanism that embodies a decision taking process. The task of the latter is to find a tradeoff between the reliability of the sensorial stimuli (bottomup attention) and the plausibility of prior knowledge (topdown attention). The mistispart and in particular the PhD work of Vasil Kalidhov is to contribute to the development of theoretical and algorithmic models based on probabilistic and statistical modelling of both the input and the processed data. Bayesian theory and hidden Markov models in particular will be combined with efficient optimization techniques in order to confront physical inputs and prior knowledge.
The final review of the project was held on December 11 and 12, 2008 with in particular a live demo running on the POP audiovisual head regarding multispeaker localisation using binoral
and binocular cues. Further details on the project web site
http://
S. Girard has joint work with M. El Aroui (ISG Tunis).
F. Forbes has joint work with C. Fraley and A. Raftery (Univ. of Washington, USA).
F. Forbes is member of the group in charge of incentive initiatives (GTAI) in the Scientific and Technological Orientation Council (COST) of INRIA.
F. Forbes is part of an INRA (French National Institute for Agricultural Research) Network (MSTGA) on spatial statistics.
She is also part of an INRA committee (CSS MBIA) in charge of evaluating INRA researchers once a year.
S. Girard is member of the committee in charge of examining applications to research scientist (CR) positions at INRIA.
F. Forbes and S. Girard are members of the committees (Commissions de Spécialistes) in charge of examining applications to Faculty member positions respectively at Institut Polytechnique de Grenoble (INPG) and at University Pierre Mendes France (UPMF, Grenoble II) and University Montpellier II.
F. Forbes was involved in the PhD committee of Benoit Scherrer from INSERM and Grenoble Institut des Neurosciences. The thesis title was "Segmentation des tissus et structures sur les IRM cerebrales: agents markoviens locaux cooperatifs et formulation Bayesienne" and the defence held on December 12, 2008.
S. Girard was involved in the PhD commitee of Sonia HedliGriche from University Grenoble II "Estimation de l'opérateur de régression pour des données fonctionnelles et des erreurs corrélées" (January 2008) and of Matthieu Brucher from University Strasbourg I "Représentations compactes et apprentissage non supervisé de variéés non linéaires. Applications au traitement d'image" (October 2008).
F. Forbes lectured a graduate course on the EM algorithm at Univ. J. Fourier, Grenoble I.
L. Gardes and M.J. Martinez are faculty members at Univ. P. MendesFrance.
L. Gardes and S. Girard lectured a graduate course on Extreme Value Analysis at Univ. J. Fourier, Grenoble I.
J.B. Durand is faculty member at INPG, Grenoble.