The team mistisaims at developing statistical methods for dealing with complex problems or data. Our applications consist mainly of image processing and spatial data problems with some applications in biology and medicine. Our approach is based on the statement that complexity can be handled by working up from simple local assumptions in a coherent way, defining a structured model, and that is the key to modelling, computation, inference and interpretation. The methods we focus on involve mixture models, Markov models, and more generally hidden structure models identified by stochastic algorithms on one hand, and semi and non-parametric methods on the other hand.

Hidden structure models are useful for taking into account heterogeneity in data. They concern many areas of statistical methodology (finite mixture analysis, hidden Markov models, random effect models, ...). Due to their missing data structure, they induce specific difficulties for both estimating the model parameters and assessing performance. The team focuses on research regarding both aspects. We design specific algorithms for estimating the parameters of missing structure models and we propose and study specific criteria for choosing the most relevant missing structure models in several contexts.

Semi and non-parametric methods are relevant and useful when no appropriate parametric model exists for the data under study either because of data complexity, or because information is missing. The focus is on functions describing curves or surfaces or more generally manifolds rather than real valued parameters. This can be interesting in image processing for instance where it can be difficult to introduce parametric models that are general enough (e.g. for contours).

In a first approach, we consider statistical parametric models,
being the parameter possibly multi-dimensional usually unknown and to be estimated. We consider cases where the data naturally divide into observed data
y=
y_{1}, ...,
y_{n}and unobserved or missing data
z=
z_{1}, ...,
z_{n}. The missing data
z_{i}represents for instance the memberships to one of a set of
Kalternative categories. The distribution of an observed
y_{i}can be written as a finite mixture of distributions,

These models are interesting in that they may point out an hidden variable responsible for most of the observed variability and so that the observed variables are
*conditionally*independent. Their estimation is often difficult due to the missing data. The Expectation-Maximization (EM) algorithm is a general and now standard approach to maximization
of the likelihood in missing data problems. It provides parameters estimation but also values for missing data.

Mixture models correspond to independent
z_{i}'s. They are more and more used in statistical pattern recognition. They allow a formal (model-based) approach to (unsupervised) clustering.

Graphical modelling provides a diagrammatic representation of the logical structure of a joint probability distribution, in the form of a network or graph depicting the local relations among variables. The graph can have directed or undirected links or edges between the nodes, which represent the individual variables. Associated with the graph are various Markov properties that specify how the graph encodes conditional independence assumptions.

It is the conditional independence assumptions that give the graphical models their fundamental modular structure, enabling computation of globally interesting quantities from local specifications. In this way graphical models form an essential basis for our methodologies based on structures.

The graphs can be either directed, e.g. Bayesian Networks, or undirected, e.g. Markov Random Fields. The specificity of Markovian models is that the dependencies between the nodes are
limited to the nearest neighbor nodes. The neighborhood definition can vary and be adapted to the problem of interest. When parts of the variables (nodes) are not observed or missing, we refer
to these models as Hidden Markov Models (HMM). Hidden Markov chains or hidden Markov fields correspond to cases where the
z_{i}'s in (
) are distributed according to a Markov chain or a Markov
field. They are natural extension of mixture models. They are widely used in signal processing (speech recognition, genome sequence analysis) and in image processing (remote sensing, MRI,
etc.). Such models are very flexible in practice and can naturally account for the phenomena to be studied.

They are very useful in modelling spatial dependencies but these dependencies and the possible existence of hidden variables are also responsible for a typically large amount of computation. It follows that the statistical analysis may not be straightforward. Typical issues are related to the neighborhood structure to be chosen when not dictated by the context and the possible high dimensionality of the observations. This also requires a good understanding of the role of each parameter and methods to tune them depending on the goal in mind. As regards, estimation algorithms, they correspond to an energy minimization problem which is NP-hard and usually performed through approximation. We focus on a certain type of methods based on the mean field principle and propose effective algorithms which show good performance in practice and for which we also study theoretical properties. We also propose some tools for model selection. Eventually we investigate ways to extend the standard Hidden Markov Field model to increase its modelling power.

We also consider methods which do not assume a parametric model. The approaches are non-parametric in the sense that they do not require the assumption of a prior model on the unknown
quantities. This property is important since, for image applications for instance, it is very difficult to introduce sufficiently general parametric models because of the wide variety of image
contents. As an illustration, the grey-levels surface in an image cannot usually be described through a simple mathematical equation. Projection methods are then a way to decompose the unknown
signal or image on a set of functions (
*e.g.*wavelets). Kernel methods which rely on smoothing the data using a set of kernels (usually probability distributions), are other examples. Relationships exist between these methods
and learning techniques using Support Vector Machine (SVM) as this appears in the context of
*boundary estimation*and
*image segmentation*. Such non parametric methods have become the cornerstone when dealing with functional data
. This is the case for instance when
observations are curves. They allow to model the data without a discretization step. More generally, these techniques are of great use for dimension reduction purposes. They permit to reduce
the dimension of the functional or multivariate data without assumptions on the observations distribution. Semi parametric methods refer to methods that include both parametric and non
parametric aspects. This is the case in
*extreme value analysis*
, which is based on the modelling of
distribution tails. It differs from traditionnal statistics which focus on the central part of distributions,
*i.e.*on the most probable events. Extreme value theory shows that distributions tails can be modelled by both a functional part and a real parameter, the extreme value index. As another
example, relationships exist between multiresolution analysis and parametric Markov tree models.

Extreme value theory is a branch of statistics dealing with the extreme deviations from the bulk of probability distributions. More specifically, it focuses on the limiting distributions
for the minimum or the maximum of a large collection of random observations from the same arbitrary distribution. Let
x_{1}...
x_{n}denote
nordered observations from a random variable
Xrepresenting some quantity of interest. A
p_{n}-quantile of
Xis the value
q_{pn}such that the probability that
Xis greater than
q_{pn}is
p_{n},
*i.e.*
P(
X>
q
_{pn}) =
p
_{n}. When
p_{n}<1/
n, such a quantile is said to be extreme since it is usually greater than the maximum observation
x_{n}(see Figure
). To estimate such quantiles requires therefore specific
methods
,
to extrapolate information beyond the observed
values of
X. Those methods are based on Extreme value theory. This kind of issues appeared in hydrology. One objective was to assess risk for highly unusual events, such as
100-year floods, starting from flows measured over 50 years. More generally, the problems that we address are part of the risk management theory. For instance, in reliability, the
distributions of interest are included in a semi-parametric family whose tails are decreasing exponentially fast
. These so-called Weibull-tail
distributions
,
,
are defined by their survival distribution
function:

where both
>0and the function
(
x)are unknown. Gaussian, gamma, exponential and Weibull distributions, among others, are included in this family. The function
(
x)acts as a nuisance parameter which yields a bias in the classical extreme-value estimators developped so far.

Boundary estimation, or more generally, level sets estimation is a recurrent problem in statistics which is linked to outlier detection. In biology, one is interested in estimating reference curves, that is to say curves which bound 90%(for example) of the population. Points outside this bound are considered as outliers compared to the reference population. In image analysis, the boundary estimation problem arises in image segmentation as well as in supervised learning.

Our work on high dimensional data includes non parametric aspects. They are related to Principal Component Analysis (PCA) which is traditionnaly used to reduce dimension in data. However, standard linear PCA can be quite inefficient on image data where even simple image distorsions can lead to highly non linear data. When dealing with classification problems, our main project is then to adapt the non linear PCA method proposed in , . This method (first introduced in Stéphane Girard's PhD thesis) relies on the approximation of datasets by manifolds, generalizing the PCA linear subspaces. This approach reveals good performances when data are images .

Our work also include parametric approaches in particular when considering classification and learning issues. In high dimensional spaces learning methods suffer from the curse of dimensionality: even for large datasets, large parts of the spaces are left empty. One of our approach is therefore to develop new Gaussian models of high dimensional data for parametric inference. Such models can then be used in a Mixtures or Markov framework for classification purposes.

As regards applications, several areas of image analysis can be covered using the tools developed in the team. More specifically, we address in collaboration with Team Lear, Inria Rhone-Alpes, issues about object and class recognition and about the extraction of visual information from large image data bases.

Other applications in medical imaging are natural. We work more specifically on MRI data.

We also consider other statistical 2D fields coming from other domains such as remote sensing.

A second domain of applications concerns biomedical statistics and molecular biology. We consider the use of missing data models in population genetics. We also investigate statistical tools for the analysis of bacterial genomes beyond gene detection. Applications in agronomy are also considered.

Reliability and industrial lifetime analysis are applications developed through collaborations with the EDF research department and the LCFR laboratory of CEA / Cadarache. We also consider failure detection in print infrastructure through collaborations with Xerox, Meylan.

**HDDA Toolbox.**The High-Dimensional Discriminant Analysis (HDDA) toolbox contains efficient supervised classifiers for high-dimensional data. These classifiers are based on Gaussian models
adapted to high-dimensional data. The HDDA toolbox is available for Matlab and will be soon included into the software MixMod. Version 1.1 of the HDDA Toolbox is now available.

**HDDC Toolbox.**The High-Dimensional Data Classification (HDDC) toolbox contains efficient unsupervised classifiers for high-dimensional data. These classifiers are also based on Gaussian
models adapted to high-dimensional data. The HDDC toolbox is available for Matlab.

Both toolboxes are available at http://ace.acadiau.ca/math/bouveyron/softwares.html

Joint work with Jean Diebolt (CNRS), Myriam Garrido (INRA Clermont-Ferrand) and Jérôme Ecarnot.

The Extremessoftware is a toolbox dedicated to the modelling of extremal events offering extreme quantile estimation procedures and model selection methods. This software results from a collaboration with EDF R&D. It is also a consequence of the PhD thesis work of Myriam Garrido. The software is written in C++ with a Matlab graphical interface. It is now available both on Windows and Linux environments. It can be downloaded at the following URL: http://mistis.inrialpes.fr/software/EXTREMES/.

Recently, this software has been used to propose a new goodness-of-fit test to the distribution tail .

The SpaCEM
^{3}(Spatial Clustering with EM and Markov Models) program replaces the former, still available, SEMMS (Spatial EM for Markovian Segmentation) program developed with Nathalie Peyrard from
INRA Avignon.

SpaCEM
^{3}proposes a variety of algorithms for image segmentation, supervised and unsupervised classification of multidimensional and spatially located data. The main techniques use the EM
algorithm for soft clustering and Markov Random Fields for spatial modelling. The learning and inference parts are based on recent developments based on mean field approximations. The main
functionalities of the program include:

The former SEMMS functionalities,
*ie.*

Model based unsupervised image segmentation, including the following models: Hidden Markov Random Field and mixture model;

Model selection for the Hidden Markov Random Field model;

Simulation of commonly used Hidden Markov Random Field models (Potts models).

Simulation of an independent Gaussian noise for the simulation of noisy images.

And additional possibilities such as,

New Markov models including various extensions of the Potts model and triplets Markov models;

Additional treatment of very high dimensional data using dimension reduction techniques within a classification framework;

Models and methods allowing supervised classification with new learning and test steps.

The SEMMS package, written in C, is publicly available at:
http://mistis.inrialpes.fr/software/SEMMS.html. The SpaCEM
^{3}written in C++ is available at
http://mistis.inrialpes.fr/software/SpaCEM3.tgz.

This is joint work with Olivier Francois (TimB, TIMC).

The FASTRUCT program is dedicated to the modelling and inference of population structure from genetic data. Bayesian model-based clustering programs have gained increased popularity in studies of population structure since the publication of the software STRUCTURE . These programs are generally acknowledged as performing well, but their running-time may be prohibitive. FASTRUCT is a non-Bayesian implementation of the classical model with no-admixture uncorrelated allele frequencies. This new program relies on the Expectation-Maximization principle, and produces assignment rivaling other model-based clustering programs. In addition, it can be several-fold faster than Bayesian implementations. The software consists of a command-line engine, which is suitable for batch-analysis of data, and a MS Windows graphical interface, which is convenient for exploring data.

It is written for Windows OS and contains a detailed user's guide. It is available at http://mistis.inrialpes.fr/realisations.html.

The functionalities are further described in the related publication:

Molecular Ecology Notes 2006 .

Joint work with Serge Iovleff (Université Lille 3) and Cordelia Schmid (Lear, Inria).

In the PhD work of Charles Bouveyron (co-advised by Cordelia Schmid from the INRIA team LEAR) , we propose new Gaussian models of high dimensional data for classification purposes. We assume that the data live in several groups located in subspaces of lower dimensions. Two different strategies arise:

the introduction in the model of a dimension reduction constraint for each group,

the use of parsimonious models obtained by imposing to different groups to share the same values of some parameters.

This modelling yields new supervised classification methods called HDDA for High Dimensional Discriminant Analysis , . Some versions of this method have been tested on the supervised classification of objects in images. This approach has been adapted to the unsupervised classification framework, and the related method is named HDDC for High Dimensional Data Clustering , In collaboration with Gilles Celeux and Charles Bouveyron we are currently working on the automatic selection of the discrete parameters of the model. We also, in the context of Juliette Blanchet PhD work (also co-advised with C. Schmid), combined the method to our Markov-model based approach of learning and classification and obtained significant improvement in applications such as texture recognition , , where the observations are high-dimensional.

We are then also willing to get rid of the Gaussian assumption. To this end, non linear models and semi parametric methods are necessary.

This is joint work with Cordelia Schmid, (LEAR, INRIA Rhône-Alpes)

**Supervised framework.**In this framework, small scale-invariant regions are detected on a learning image set and they are then characterized by the local descriptor
Sift. The object is recognized in a test image if
a sufficient number of matches with the learning set is found. The recognition step is done using supervised classification methods. Frequently used methods are Linear Discriminant Analysis
(LDA) and, more recently, kernel methods (SVM)
chap. 12. In our approach, the object is
represented as a set of object parts. As an example fora motorbike, we will consider three parts: wheels, seat and handlebars.

Obtained results showed that the HDDA method described in Section gives better recognition results than SVM and other generative methods. In particular, the classification errors are significantly lower for HDDA compared to SVM. In addition, HDDA method is as fast as standard discriminant analysis (computation time sec. for 1000 descriptors) and much faster than SVM ( sec.).

**Unsupervised framework.**Our approach learns automatically discriminant object parts and then identifies local descriptors belonging to the object. It first extracts a set of
scale-invariant descriptors and then learns a set of discriminative object parts based on a set of positive and negative images. Learning is "weakly supervised" since objects are not
segmented in the positive images. Recognition matches descriptors of a unknown image to the discriminative object parts.

Object localization is a challenging problem since it requires a very precise classification of descriptors. For this, it is necessary to identify the descriptors of an image which have a high probability to belong to the object. The adaptation of HDDA to the unsupervised framework, called HDDC, allows to compute the posterior probability for each interest point that it belongs to the object. Finally, the object can be located in a test image by considering the points with the highest probabilities. In practice, 5 or 10 percents of all detected interest points are enough to locate efficiently the object. See an illustration in Figure .

We also consider the application of image classification
. This step decides if the object is present
in the image, i.e. it classifies the image as positive (containing the object) or negative (not containing the object). We use our decision rule to assign a posterior probability to each
descriptor and each cluster. We then decide based on these probabilities if a test image contains the object or not. Previous approaches
have used a simple empirical technique to
classify a test image. We introduce a probabilistic technique which uses the posterior probabilities. We obtain for a test image
Ia score
S[0, 1]that
Icontains the object. We decide that a test image contains the object if the score
Sis larger than a given threshold. This probabilistic decision has the advantage of not introducing an additional parameter and of using the posterior probability to
reject (assign a low weight) to dubious points
.

In this work, we focus on three sources of complexity. We consider data exhibiting (complex) dependence structures, having to do for example with spatial or temporal association, family relationship, and so on. More specifically, we consider observations associated to sites or items at spatial locations. These locations can be irregularly spaced. This goes beyond the standard regular lattice case traditionnaly used in image analysis and requires some adaptation.

A second source of complexity is connected with the measurement process, such as having multiple measuring instruments or computations generating high dimensional data. There are not so many 1-dimensional distributions for continuous variables that generalize to multidimensional ones except when considering product of 1-dimensional independent components. The Gaussian distribution is the most commonly used but it has the specificity to be unimodal. Also, what we consider as a third source of complexity is that in real-world applications, data cannot usually be reduced to classes modeled by unimodal distributions and consequently by single gaussian distributions.

In this work, we consider supervised classification problems in which training sets are available and correspond to data for which data exemplars have been grouped into classes.

We propose a unified Markovian framework for both learning the class models and then consequently classify observed data into these classes. We show that models able to deal with the above sources of complexity can be derived based on traditional tools such as mixture models and Hidden Markov fields. For the latter, however, non trivial extensions in the spirit of are required to include a learning step while preserving the Markovian modelling of the depedencies. Applications of our models include textured image segmentation. See an illustration in Figure .

Clustering of genes into groups sharing common characteristics is a useful exploratory technique for a number of subsequent computational and biological analysis. A wide range of clustering algorithms have been proposed in particular to analyze gene expression data but most of them consider genes as independent entities or include relevant information on gene interactions in a sub-optimal way.

We propose a probabilistic model that has the advantage to account for individual data (
*eg.*expression) and pairwise data (
*eg.*interaction information coming from biological networks) simultaneously. Our model is based on hidden Markov random field models in which parametric probability distributions
account for the distribution of individual data. Data on pairs, possibly reflecting distances or similarity measures between genes, are then included through a graph where the nodes represent
the genes and the edges are weighted according to the available interaction information. As a probabilistic model, this model has many interesting theoretical features. Also, preliminary
experiments on simulated and real data show promising results and points out the gain in using such an approach
,
,
,
.

This is joint work with Benoit Scherrer, Michel Dojat and Christine Garbay from INSERM and LIG.

Accurate tissue and structure segmentation of MRI brain scan is critical for several applications. Markov random fields are commonly used for tissue segmentation to take into account spatial dependencies between voxels, hence acting as a labelling regularization. However, such a task requires the estimation of the model parameters (eg. Potts model) which is not tractable without approximations. The algorithms in based on EM and variational approximations are considered. They show interesting results for tissue segmentation but are not sufficient for structure segmentation without introducing a priori anatomical knowledge. In most approaches, structure segmentation is performed after tissue segmentation. We suggest considering them as combined processes that cooperate. Brain anatomy is described by fuzzy spatial relations between structures that express general relative distances, orientations or symmetries. This knowledge is incorporated into a 2-class Markov model via an external field. This model is used for structure segmentation. The resulting structure information is then incorporated in turn into a 3 to 5-class Markov model for tissue segmentation via another specific external field. Tissue and structure segmentations thus appear as dynamical and cooperative MRF procedures whose performance increases gradually. This approach is implemented into a multi-agent framework, where autonomous entities, distributed into the image, estimate local Markov fields and cooperate to ensure consistency , . We show, using phantoms and real images (acquired on a 3T scanner), that a distributed and cooperative Markov modelling using anatomical knowledge is a powerful approach for MRI brain scan segmentation (See Figure ).

The current investigation concerns only one type (T1) of MR images with no temporal information. We are planning to extend our tools to include multidimensional MR sequences corresponding to other types of MR modalities and longitudinal data.

This is joint work with Olivier François from team TimB in TIMC laboratory.

In applications of population genetics, it is often useful to classify individuals in a sample into populations which become then the units of interest. However, the definition of populations is typically subjective, based, for example, on linguistic, cultural, or physical characters as well as the geographic location of sampled individuals. Recently, Pritchard et al , proposed a Bayesian approach to classify individuals into groups using genotype data. Such data, also called multilocus genotype data, consists of several genetic markers whose variations are measured at a series of loci for each sampled individual. Their method is based on a parametric model (model-based clustering) in which there are K groups (where K may be unknown), each of which is characterized by a set of allele frequencies at each locus. Group allele frequencies are unknown and modeled by a Dirichlet distribution at each locus within each group. A MCMC algorithm is then used to estimate simultaneously assignment probabilities and allele frequencies for all groups. In such a model, individuals are assumed to be independent, which does not take into account their possible spatial proximity.

The main goal of this work is to introduce spatial prior models and to assess their role in accounting for the relationships between individuals. In this perspective, we propose to investigate particular Markov models on graphs and to evaluate the quality of mean field approximations for the estimation of their parameters.

Maximum likelihood estimation of such models in a spatial context is typically intractable but mean field like approximations within an EM algorithm framework, in the spirit of will be considered to deal with this problem. This should result in a procedure alternative to MCMC approaches. With this in mind, we first considered the EM approach in a non spatial case, as an alternative to the traditional Bayesian approaches. The corresponding new computer program (see Section ) and promising results are reported in .

This is joint work with Sylvain Douté and Etienne Deforas from Laboratoire de Planétologie de Grenoble, France.

Visible and near infrared imaging spectroscopy is one of the key techniques to detect, to map and to characterize mineral and volatile (eg. water-ice) species existing at the surface of the planets. Indeed the chemical composition, granularity, texture, physical state, etc. of the materials determine the existence and morphology of the absorption bands. The resulting spectra contain therefore very useful information. Current imaging spectrometers provide data organized as three dimensional hyperspectral images: two spatial dimensions and one spectral dimension.

A new generation of imaging spectrometers is emerging with an additional angular dimension. The surface of the planets will now be observed from different view points on the satellite trajectory, corresponding to about ten different angles, instead of only one corresponding usually to the vertical (0 degree angle) view point. Multi-angle imaging spectrometers present several advantages: the influence of the atmosphere on the signal can be better identified and separated from the surface signal on focus, the shape and size of the surface components and the surfaces granularity can be better characterized.

However, this new generation of spectrometers also results in a significant increase in the size (several tera-bits expected) and complexity of the generated data. Consequently, HMA (Hyperspectral Multi Angular) data induce data manipulation and visualization problems due to its size and its 4 dimensionality.

We propose to investigate the use of statistical techniques to deal with these generic sources of complexity in data beyond the traditional tools in mainstream statistical packages. Our goal is twofold:

we first focus on developing or adapting dimension reduction methods, classification and segmentation methods for informative, useful visualization and representation of the data previous to its subsequent analysis.

We also address the problem of physical model inversion which is important to understand the complex underlying physics of the HMA signal formation. The models taking into account the angular dimension result in more complex treatments. We investigate the use of semiparametric dimension reduction methods such as SIR (Sliced Inverse Regression, ) to perform model inversion, in a reasonable computing time, when the number of input observations increases considerably.

The first data set under consideration (hyperspectral images with vertical pointing) comes from the Mars-Express Mission operated by the European Space Agency. The second data set (multi-angular hyperspectral images) will be generated by the CRISM instrument of the Mars Reconnaissance Orbiter (NASA) that has started its scientific activities in June 2006 after orbit insertion. LPG is a co-investigator of the CRISM instrument.

This is joint work with Cécile Amblard (TimB in TIMC laboratory, Univ. Grenoble 1), Myriam Garrido (INRA Clermont-Ferrand), Armelle Guillou (Univ. Strasbourg), and Jean Diebolt (CNRS, Univ. Marne-la-vallée).

Our first achievement is the development of new estimators: kernel estimators and bias correction through exponential regression . Our second achievement is the construction of a goodness-of-fit test for the distribution tail. Usual tests are not adapted to this problem since they essentially check the adequation to the central part of the distribution. Next, we aim at adapting extreme-value estimators to take into account covariate information. Such estimators would include extreme conditional quantiles estimators, which are closely linked to the frontier estimators presented in Section . Finally, more future work will include the study of multivariate extreme values. To this aim, a research on some particular copulas , has been initiated with Cécile Amblard, since they are the key tool for building multivariate distributions .

This is joint work with Anatoli Iouditski (Univ. Joseph Fourier, Grenoble), Guillaume Bouchard (Xerox, Meylan), Pierre Jacob and Ludovic Menneteau (Univ. Montpellier 2) and Alexandre Nazin (IPU, Moscow, Russia).

Two different and complementary approaches are developped.

Here, the boundary bounding the set of points is viewed as the larger level set of the points distribution. This is then an extreme quantile curve estimation problem. We propose estimators based on projection as well as on kernel regression methods applied on the extreme values set , , , , for particular set of points. In this framework, we can obtain the asymptotic distribution of the error between estimators and the true frontier , . Our future work will be to define similar methods based on wavelets expansions in order to estimate non-smooth boundaries, and on local polynomials estimators to get rid of boundary effects. Besides, we are also working on the extension of our results to more general sets of points. This work has been initiated in the PhD work of Laurent Gardes , co-directed by Pierre Jacob and Stéphane Girard and in with the consideration of star-shaped supports.

Here, the boundary of a set of points is defined has a closed curve bounding all the points and with smallest associate surface. It is thus natural to reformulate the
boundary estimation method as a linear programming problem
,
,
. The resulting estimate is parsimonious,
it only relies on a small number of points. This method belongs to the Support Vector Machines (SVM) techniques. Their finite sample performances are very impressive but their asymptotic
properties are not very well known, the difficulty being that there is no explicit formula of the estimator. However, such properties are of great interest, in particular to reduce the
estimator bias. Two directions of research will be investigated. The first one consists in modifying the optimization problem itself. The second one is to use
*Jacknife*like methods, combining two biased estimators so that the two bias cancel out. One of the goals of our work is also to establish the speed of convergence of such methods in
order to try to improve them.

This is joint work with Nadia Perot, Nicolas Devictor and Michel Marquès (CEA).

One of the main activities of the Laboratoire de Conduite et Fiabilité des Réacteurs (CEA Cadarache) concerns the probabilistic analysis of some processes using reliability and statistical methods. In this context, probabilistic modelling of steels tenacity in nuclear plants tanks has been developed. The databases under consideration include hundreds of data indexed by temperature, so that, reliable probabilistic models have been obtained for the central part of the distribution.

However, in this reliability problem, the key point is to investigate the behaviour of the model in the distribution tail. In particular, we are mainly interested in studying the lowest tenacities when the temperature varies. We are currently investigating the opportunity to propose a postdoctoral position on this problem, supported by the CEA.

We signed in december 2006 a CIFRE contract with Xerox, Meylan, regarding the PhD work of Laurent Donini about statistical techniques for mining logs and usage data in a print infrastructure. The thesis will be co-advised by Stéphane Girard and Jean-Michel Renders (Xerox).

mistisparticipates in the weekly statistical seminar of Grenoble, F. Forbes is one of the organizers and several lecturers have been invited in this context.

mistisgot a Ministry grant (Action Concertée Incitative Masses de données) for a three-year project involving other partners (Team Lear from INRIA, SMS from University Joseph Fourier and Heudiasyc from UTC, Compiègne). The project called Movistar aims at investigating visual and statistical models for image recognition and description and learning techniques for the management of large image databases.

Since July 2005, MISTIS is also involved in the IBN (Integrated Biological Networks) project coordinated by Marie-France Sagot from INRIA team HELIX. This project is part of the Cooperative Research Initiative (ARC) supported by INRIA. The other partners include two other INRIA teams (HELIX and SYMBIOSE, Pasteur Institute and INRA, Jouy-en-Josas.

J. Blanchet, C. Bouveyron, F. Forbes and S. Girard are members of the Pascal Network of Excellence.

S. Girard is a member of the European project (Interuniversity Attraction Pole network) ``Statistical techniques and modelling for complex substantive questions with complex data'',

Web site : http://www.stat.ucl.ac.be/IAP/frameiap.html.

S. Girard has also joint work with Prof. A. Nazin (Institute of Control Science, Moscow, Russia).

MISTIS is then involved in a European STREP proposal, named POP (Perception On Purpose) coordinated by Radu Horaud from INRIA team MOVI. The three-year project starts in January 2006. Its objective is to put forward the modelling of perception (visual and auditory) as a complex attentional mechanism that embodies a decision taking process. The task of the latter is to find a trade-off between the reliability of the sensorial stimuli (bottom-up attention) and the plausibility of prior knowledge (top-down attention). The MISTIS part and in particular the PhD work of Vasil Kalidhov is to contribute to the development of theoretical and algorithmic models based on probabilistic and statistical modelling of both the input and the processed data. Bayesian theory and hidden Markov models in particular will be combined with efficient optimization techniques in order to confront physical inputs and prior knowledge.

S. Girard has joint work with M. El Aroui (ISG Tunis).

F. Forbes has joint work with:

- C. Fraley (Univ. of Washington, USA)

- A. Raftery (Univ. of Washington, USA)

F. Forbes is member of the group in charge of incentive initiatives (GTAI) in the Scientific and Technological Orientation Council (COST) of INRIA.

S. Girard was involved in the PhD commitee of Charles Bouveyron from university Joseph Fourier. Title of the thesis in French: Modelisation et classification de données de grandes dimensions, application à l'analyse d'images.

He was also involved in the PhD commitee of Aurélie Muller from university Montpellier 2. Title of the thesis: Comportement asymptotique de la distribution des pluies extremes en France.

F. Forbes lectured a graduate course on the EM algorithm at Univ. J. Fourier, Grenoble.

L. Gardes is faculty member at Univ. P. Mendes France and Stéphane Girard was faculty member at Univ. J. Fourier in Grenoble until june 2006.

H. Berthelon is faculty member at CNAM, Paris.

Florence Forbes and Matthieu Vignes were invited to the 31st conference on Stochastic Processes and their Applications in Paris, France.

Stéphane Girard was invited speaker at the workshop on Principal manifolds for data cartography and dimension reduction in Leceister, UK, august 2006 and at the IMS annual meeting and X Brazilian School of Probability in Rio de Janeiro, Brazil, July 2006.